Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote:
> On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > > 
> > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > > 
> > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > > > when
> > > > > > > > > + dereferencing.  For example, the following (rather 
> > > > > > > > > improbable)
> > > > > > > > > + code is buggy:
> > > > > > > > > +
> > > > > > > > > + int a[2];
> > > > > > > > > + int index;
> > > > > > > > > + int force_zero_index = 1;
> > > > > > > > > +
> > > > > > > > > + ...
> > > > > > > > > +
> > > > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > > +
> > > > > > > > > + The reason this is buggy is that "&&" and "||" are 
> > > > > > > > > often compiled
> > > > > > > > > + using branches.  While weak-memory machines such as ARM 
> > > > > > > > > or PowerPC
> > > > > > > > > + do order stores after such branches, they can speculate 
> > > > > > > > > loads,
> > > > > > > > > + which can result in misordering bugs.
> > > > > > > > > +
> > > > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > > > "!=",
> > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > > example,
> > > > > > > > > + the following (quite strange) code is buggy:
> > > > > > > > > +
> > > > > > > > > + int a[2];
> > > > > > > > > + int index;
> > > > > > > > > + int flip_index = 0;
> > > > > > > > > +
> > > > > > > > > + ...
> > > > > > > > > +
> > > > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > > +
> > > > > > > > > + As before, the reason this is buggy is that relational 
> > > > > > > > > operators
> > > > > > > > > + are often compiled using branches.  And as before, 
> > > > > > > > > although
> > > > > > > > > + weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > > stores
> > > > > > > > > + after such branches, but can speculate loads, which can 
> > > > > > > > > again
> > > > > > > > > + result in misordering bugs.
> > > > > > > > 
> > > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > > proposed,
> > > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > > (unless
> > > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > > values that
> > > > > > > > flip_index can have).
> > > > > > > 
> > > > > > > And I am OK with the value_dep_preserving type providing 
> > > > > > > more/better
> > > > > > > guarantees than we get by default from current compilers.
> > > > > > > 
> > > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > > dependency to be tracked through a comparison.)
> > > > > > 
> > > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > > comparison?
> > > > > 
> > > > > That should work well assuming that things like "if", "while", and 
> > > > > "?:"
> > > > > conditions are happy to take a vdp.  This assumes that p->a only 
> > > > > returns
> > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > > through the program.  ;-)
> > > > > 
> > > > > The other thing that can happen is that a vdp can get handed off to
> > > > > another synchronization mechanism, for example, to reference counting:
> > > > > 
> > > > >   p = atomic_load_explicit(, memory_order_consume);
> > > > >   if (do_something_with(p->a)) {
> > > > >   /* fast path protected by RCU. */
> > > > >   return 0;
> > > > >   }
> > > > >   if (atomic_inc_not_zero(>refcnt) {
> > > > >   /* slow path protected by 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote:
> xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6)
> 
> On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > > 
> > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o  Do not use the results from the boolean "&&" and "||" 
> > > > > > > > when
> > > > > > > > +   dereferencing.  For example, the following (rather 
> > > > > > > > improbable)
> > > > > > > > +   code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   The reason this is buggy is that "&&" and "||" are 
> > > > > > > > often compiled
> > > > > > > > +   using branches.  While weak-memory machines such as ARM 
> > > > > > > > or PowerPC
> > > > > > > > +   do order stores after such branches, they can speculate 
> > > > > > > > loads,
> > > > > > > > +   which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o  Do not use the results from relational operators ("==", 
> > > > > > > > "!=",
> > > > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > example,
> > > > > > > > +   the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int flip_index = 0;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   As before, the reason this is buggy is that relational 
> > > > > > > > operators
> > > > > > > > +   are often compiled using branches.  And as before, 
> > > > > > > > although
> > > > > > > > +   weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > stores
> > > > > > > > +   after such branches, but can speculate loads, which can 
> > > > > > > > again
> > > > > > > > +   result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > (unless
> > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > values that
> > > > > > > flip_index can have).
> > > > > > 
> > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > guarantees than we get by default from current compilers.
> > > > > > 
> > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > dependency to be tracked through a comparison.)
> > > > > 
> > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > comparison?
> > > > 
> > > > That should work well assuming that things like "if", "while", and "?:"
> > > > conditions are happy to take a vdp.
> > > 
> > > I currently don't see a reason why that should be disallowed.  If we
> > > have allowed an implicit conversion to non-vdp, I believe that should
> > > follow.
> > 
> > I am a bit nervous about a silent implicit conversion from vdp to
> > non-vdp in the general case.
> 
> Why are you nervous about it?

If someone expects the vdp to propagate into some function that might
be compiled with aggressive optimizations that break this expectation,
it would be good for that someone to know about it.

Ah!  I am assuming that the compiler is -not- emitting memory barriers
at vdp-to-non-vdp transitions.  In that case, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel
On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o  Do not use the results from the boolean "&&" and "||" 
> > > > > > > > when
> > > > > > > > +   dereferencing.  For example, the following (rather 
> > > > > > > > improbable)
> > > > > > > > +   code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   The reason this is buggy is that "&&" and "||" are 
> > > > > > > > often compiled
> > > > > > > > +   using branches.  While weak-memory machines such as ARM 
> > > > > > > > or PowerPC
> > > > > > > > +   do order stores after such branches, they can speculate 
> > > > > > > > loads,
> > > > > > > > +   which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o  Do not use the results from relational operators ("==", 
> > > > > > > > "!=",
> > > > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > example,
> > > > > > > > +   the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int flip_index = 0;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   As before, the reason this is buggy is that relational 
> > > > > > > > operators
> > > > > > > > +   are often compiled using branches.  And as before, 
> > > > > > > > although
> > > > > > > > +   weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > stores
> > > > > > > > +   after such branches, but can speculate loads, which can 
> > > > > > > > again
> > > > > > > > +   result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > (unless
> > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > values that
> > > > > > > flip_index can have).
> > > > > > 
> > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > guarantees than we get by default from current compilers.
> > > > > > 
> > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > dependency to be tracked through a comparison.)
> > > > > 
> > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > comparison?
> > > > 
> > > > That should work well assuming that things like "if", "while", and "?:"
> > > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > through the program.  ;-)
> > > > 
> > > > The other thing that can happen is that a vdp can get handed off to
> > > > another synchronization mechanism, for example, to reference counting:
> > > > 
> > > > p = atomic_load_explicit(, memory_order_consume);
> > > > if (do_something_with(p->a)) {
> > > > /* fast path protected by RCU. */
> > > > return 0;
> > > > }
> > > > if (atomic_inc_not_zero(>refcnt) {
> > > > /* slow path protected by reference counting. */
> > > > return do_something_else_with((struct foo *)p);  /* 
> > > > CHANGE */
> > > > }
> > > > /* Needed slow path, but raced with deletion. */
> > > > return 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel
On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > 
> > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > when
> > > > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > > > + code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int force_zero_index = 1;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > > > compiled
> > > > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > > > PowerPC
> > > > > > > + do order stores after such branches, they can speculate loads,
> > > > > > > + which can result in misordering bugs.
> > > > > > > +
> > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > "!=",
> > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > + the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int flip_index = 0;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + As before, the reason this is buggy is that relational operators
> > > > > > > + are often compiled using branches.  And as before, although
> > > > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > + after such branches, but can speculate loads, which can again
> > > > > > > + result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > (unless
> > > > > > there are further constraints due to the type of r1 and the values 
> > > > > > that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > comparison?
> > > 
> > > That should work well assuming that things like "if", "while", and "?:"
> > > conditions are happy to take a vdp.
> > 
> > I currently don't see a reason why that should be disallowed.  If we
> > have allowed an implicit conversion to non-vdp, I believe that should
> > follow.
> 
> I am a bit nervous about a silent implicit conversion from vdp to
> non-vdp in the general case.

Why are you nervous about it?

> However, when the result is being used by
> a conditional, the silent implicit conversion makes a lot of sense.
> Is that distinction something that the compiler can handle easily?

I think so. I'm not a language lawyer, but we have other such
conversions in the standard (e.g., int to boolean, between int and
float) and I currently don't see a fundamental difference to those.  But
we'll have to ask the language folks (or SG1 or LEWG) to really verify
that.

> On the other hand, silent implicit conversion from non-vdp to vdp
> is very useful for common code that can be invoked both by RCU
> readers and by updaters.

I'd be more nervous about that because then there's less obstacles to
one programmer expecting a vdp to indicate a dependency vs. another
programmer putting non-vdp into vdp.

For this case of common code (which I agree is a valid concern), would
it be a lot of programmer overhead to add explicit casts from non-vdp to
vdp?  

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel
On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
 On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
  xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
  X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
  
  On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)

On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
  X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
  
  On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
   +oDo not use the results from the boolean  and || 
   when
   + dereferencing.  For example, the following (rather improbable)
   + code is buggy:
   +
   + int a[2];
   + int index;
   + int force_zero_index = 1;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
   +
   + The reason this is buggy is that  and || are often 
   compiled
   + using branches.  While weak-memory machines such as ARM or 
   PowerPC
   + do order stores after such branches, they can speculate loads,
   + which can result in misordering bugs.
   +
   +oDo not use the results from relational operators (==, 
   !=,
   + , =, , or =) when dereferencing.  For example,
   + the following (quite strange) code is buggy:
   +
   + int a[2];
   + int index;
   + int flip_index = 0;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1 != flip_index];  /* BUGGY!!! */
   +
   + As before, the reason this is buggy is that relational operators
   + are often compiled using branches.  And as before, although
   + weak-memory machines such as ARM or PowerPC do order stores
   + after such branches, but can speculate loads, which can again
   + result in misordering bugs.
  
  Those two would be allowed by the wording I have recently proposed,
  AFAICS.  r1 != flip_index would result in two possible values 
  (unless
  there are further constraints due to the type of r1 and the values 
  that
  flip_index can have).
 
 And I am OK with the value_dep_preserving type providing more/better
 guarantees than we get by default from current compilers.
 
 One question, though.  Suppose that the code did not want a value
 dependency to be tracked through a comparison operator.  What does
 the developer do in that case?  (The reason I ask is that I have
 not yet found a use case in the Linux kernel that expects a value
 dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?
   
   That should work well assuming that things like if, while, and ?:
   conditions are happy to take a vdp.
  
  I currently don't see a reason why that should be disallowed.  If we
  have allowed an implicit conversion to non-vdp, I believe that should
  follow.
 
 I am a bit nervous about a silent implicit conversion from vdp to
 non-vdp in the general case.

Why are you nervous about it?

 However, when the result is being used by
 a conditional, the silent implicit conversion makes a lot of sense.
 Is that distinction something that the compiler can handle easily?

I think so. I'm not a language lawyer, but we have other such
conversions in the standard (e.g., int to boolean, between int and
float) and I currently don't see a fundamental difference to those.  But
we'll have to ask the language folks (or SG1 or LEWG) to really verify
that.

 On the other hand, silent implicit conversion from non-vdp to vdp
 is very useful for common code that can be invoked both by RCU
 readers and by updaters.

I'd be more nervous about that because then there's less obstacles to
one programmer expecting a vdp to indicate a dependency vs. another
programmer putting non-vdp into vdp.

For this case of common code (which I agree is a valid concern), would
it be a lot of programmer overhead to add explicit casts from non-vdp to
vdp?  Would C11 generics help with that, similarly to how C++ template
functions would?

Nonetheless, in the end this is just trading off convenient use against
different ways to catch different but simple errors.

   ?: could be somewhat special, in that the type depends on the
  2nd and 3rd operand.  Thus, vdp x = non-vdp ? vdp : vdp; should be
  allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be
  disallowed if we don't provide for implicit casts from non-vdp to vdp.
 
 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel
On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
 On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
  On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
   On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
   
   On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
+o  Do not use the results from the boolean  and || 
when
+   dereferencing.  For example, the following (rather 
improbable)
+   code is buggy:
+
+   int a[2];
+   int index;
+   int force_zero_index = 1;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
+
+   The reason this is buggy is that  and || are 
often compiled
+   using branches.  While weak-memory machines such as ARM 
or PowerPC
+   do order stores after such branches, they can speculate 
loads,
+   which can result in misordering bugs.
+
+o  Do not use the results from relational operators (==, 
!=,
+   , =, , or =) when dereferencing.  For 
example,
+   the following (quite strange) code is buggy:
+
+   int a[2];
+   int index;
+   int flip_index = 0;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1 != flip_index];  /* BUGGY!!! */
+
+   As before, the reason this is buggy is that relational 
operators
+   are often compiled using branches.  And as before, 
although
+   weak-memory machines such as ARM or PowerPC do order 
stores
+   after such branches, but can speculate loads, which can 
again
+   result in misordering bugs.
   
   Those two would be allowed by the wording I have recently 
   proposed,
   AFAICS.  r1 != flip_index would result in two possible values 
   (unless
   there are further constraints due to the type of r1 and the 
   values that
   flip_index can have).
  
  And I am OK with the value_dep_preserving type providing more/better
  guarantees than we get by default from current compilers.
  
  One question, though.  Suppose that the code did not want a value
  dependency to be tracked through a comparison operator.  What does
  the developer do in that case?  (The reason I ask is that I have
  not yet found a use case in the Linux kernel that expects a value
  dependency to be tracked through a comparison.)
 
 Hmm.  I suppose use an explicit cast to non-vdp before or after the
 comparison?

That should work well assuming that things like if, while, and ?:
conditions are happy to take a vdp.  This assumes that p-a only returns
vdp if field a is declared vdp, otherwise we have vdps running wild
through the program.  ;-)

The other thing that can happen is that a vdp can get handed off to
another synchronization mechanism, for example, to reference counting:

p = atomic_load_explicit(gp, memory_order_consume);
if (do_something_with(p-a)) {
/* fast path protected by RCU. */
return 0;
}
if (atomic_inc_not_zero(p-refcnt) {
/* slow path protected by reference counting. */
return do_something_else_with((struct foo *)p);  /* 
CHANGE */
}
/* Needed slow path, but raced with deletion. */
return -EAGAIN;

I am guessing that the cast ends the vdp.  Is that the case?
   
   And here is a more elaborate example from the Linux kernel:
   
 struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
   
 rdev = rcu_dereference(conf-mirrors[disk].rdev);
 if (r1_bio-bios[disk] == IO_BLOCKED
 || rdev == NULL
 || test_bit(Unmerged, rdev-flags)
 || test_bit(Faulty, rdev-flags))
 continue;
   
   The fact that the rdev == NULL returns vdp does not force the ||
   operators to be evaluated arithmetically because the entire function
   is an if condition, correct?
  
  That's a good question, and one that as far as I understand 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote:
 xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com
 X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6)
 
 On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
  On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
   xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
   
   On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
   
   On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
+o  Do not use the results from the boolean  and || 
when
+   dereferencing.  For example, the following (rather 
improbable)
+   code is buggy:
+
+   int a[2];
+   int index;
+   int force_zero_index = 1;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
+
+   The reason this is buggy is that  and || are 
often compiled
+   using branches.  While weak-memory machines such as ARM 
or PowerPC
+   do order stores after such branches, they can speculate 
loads,
+   which can result in misordering bugs.
+
+o  Do not use the results from relational operators (==, 
!=,
+   , =, , or =) when dereferencing.  For 
example,
+   the following (quite strange) code is buggy:
+
+   int a[2];
+   int index;
+   int flip_index = 0;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1 != flip_index];  /* BUGGY!!! */
+
+   As before, the reason this is buggy is that relational 
operators
+   are often compiled using branches.  And as before, 
although
+   weak-memory machines such as ARM or PowerPC do order 
stores
+   after such branches, but can speculate loads, which can 
again
+   result in misordering bugs.
   
   Those two would be allowed by the wording I have recently 
   proposed,
   AFAICS.  r1 != flip_index would result in two possible values 
   (unless
   there are further constraints due to the type of r1 and the 
   values that
   flip_index can have).
  
  And I am OK with the value_dep_preserving type providing more/better
  guarantees than we get by default from current compilers.
  
  One question, though.  Suppose that the code did not want a value
  dependency to be tracked through a comparison operator.  What does
  the developer do in that case?  (The reason I ask is that I have
  not yet found a use case in the Linux kernel that expects a value
  dependency to be tracked through a comparison.)
 
 Hmm.  I suppose use an explicit cast to non-vdp before or after the
 comparison?

That should work well assuming that things like if, while, and ?:
conditions are happy to take a vdp.
   
   I currently don't see a reason why that should be disallowed.  If we
   have allowed an implicit conversion to non-vdp, I believe that should
   follow.
  
  I am a bit nervous about a silent implicit conversion from vdp to
  non-vdp in the general case.
 
 Why are you nervous about it?

If someone expects the vdp to propagate into some function that might
be compiled with aggressive optimizations that break this expectation,
it would be good for that someone to know about it.

Ah!  I am assuming that the compiler is -not- emitting memory barriers
at vdp-to-non-vdp transitions.  In that case, warnings are even more
important -- without the warnings, it is a real pain chasing these
unnecessary memory barriers out of the code.

So we are -not- in the business of emitting memory barriers on
vdp-to-non-vdp transitions, right?

  However, when the result is being used by
  a conditional, the silent implicit conversion makes a lot of sense.
  Is that distinction something that the compiler can handle easily?
 
 I think so. I'm not a language lawyer, but we have other such
 conversions in the standard (e.g., int to boolean, between int and
 float) and I currently don't see a fundamental difference to those.  But
 we'll have to ask the 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote:
 On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
  On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
   On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
  X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
  
  On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
 +oDo not use the results from the boolean  and || 
 when
 + dereferencing.  For example, the following (rather 
 improbable)
 + code is buggy:
 +
 + int a[2];
 + int index;
 + int force_zero_index = 1;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
 +
 + The reason this is buggy is that  and || are 
 often compiled
 + using branches.  While weak-memory machines such as ARM 
 or PowerPC
 + do order stores after such branches, they can speculate 
 loads,
 + which can result in misordering bugs.
 +
 +oDo not use the results from relational operators (==, 
 !=,
 + , =, , or =) when dereferencing.  For 
 example,
 + the following (quite strange) code is buggy:
 +
 + int a[2];
 + int index;
 + int flip_index = 0;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1 != flip_index];  /* BUGGY!!! */
 +
 + As before, the reason this is buggy is that relational 
 operators
 + are often compiled using branches.  And as before, 
 although
 + weak-memory machines such as ARM or PowerPC do order 
 stores
 + after such branches, but can speculate loads, which can 
 again
 + result in misordering bugs.

Those two would be allowed by the wording I have recently 
proposed,
AFAICS.  r1 != flip_index would result in two possible values 
(unless
there are further constraints due to the type of r1 and the 
values that
flip_index can have).
   
   And I am OK with the value_dep_preserving type providing 
   more/better
   guarantees than we get by default from current compilers.
   
   One question, though.  Suppose that the code did not want a value
   dependency to be tracked through a comparison operator.  What does
   the developer do in that case?  (The reason I ask is that I have
   not yet found a use case in the Linux kernel that expects a value
   dependency to be tracked through a comparison.)
  
  Hmm.  I suppose use an explicit cast to non-vdp before or after the
  comparison?
 
 That should work well assuming that things like if, while, and 
 ?:
 conditions are happy to take a vdp.  This assumes that p-a only 
 returns
 vdp if field a is declared vdp, otherwise we have vdps running wild
 through the program.  ;-)
 
 The other thing that can happen is that a vdp can get handed off to
 another synchronization mechanism, for example, to reference counting:
 
   p = atomic_load_explicit(gp, memory_order_consume);
   if (do_something_with(p-a)) {
   /* fast path protected by RCU. */
   return 0;
   }
   if (atomic_inc_not_zero(p-refcnt) {
   /* slow path protected by reference counting. */
   return do_something_else_with((struct foo *)p);  /* 
 CHANGE */
   }
   /* Needed slow path, but raced with deletion. */
   return -EAGAIN;
 
 I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

rdev = rcu_dereference(conf-mirrors[disk].rdev);
if (r1_bio-bios[disk] == IO_BLOCKED
|| rdev == NULL
|| test_bit(Unmerged, rdev-flags)
|| test_bit(Faulty, rdev-flags))
continue;

The fact that the rdev == NULL returns vdp does not force the ||

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Peter Sewell
On 5 March 2014 17:15, Torvald Riegel  wrote:
> On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
>> On 3 March 2014 20:44, Torvald Riegel  wrote:
>> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
>> >> On 1 March 2014 08:03, Paul E. McKenney  
>> >> wrote:
>> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> >> Hi Paul,
>> >> >>
>> >> >> On 28 February 2014 18:50, Paul E. McKenney 
>> >> >>  wrote:
>> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >> >  wrote:
>> >> >> >> > >
>> >> >> >> > > 3.  The comparison was against another RCU-protected 
>> >> >> >> > > pointer,
>> >> >> >> > > where that other pointer was properly fetched using one
>> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
>> >> >> >> > > pointer
>> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
>> >> >> >> > > for
>> >> >> >> > > that other pointer happened after the last update to the
>> >> >> >> > > pointed-to structure.
>> >> >> >> > >
>> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >> >
>> >> >> >> > I think that it might be worth pointing out as an example, and 
>> >> >> >> > saying
>> >> >> >> > that code like
>> >> >> >> >
>> >> >> >> >p = atomic_read(consume);
>> >> >> >> >X;
>> >> >> >> >q = atomic_read(consume);
>> >> >> >> >Y;
>> >> >> >> >if (p == q)
>> >> >> >> > data = p->val;
>> >> >> >> >
>> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> >> > *either* p or q, but you can't really tell which, since the 
>> >> >> >> > compiler
>> >> >> >> > can decide that the values are interchangeable.
>> >> >> >> >
>> >> >> >> > I cannot for the life of me come up with a situation where this 
>> >> >> >> > would
>> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether 
>> >> >> >> > the
>> >> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> >> > irrelevant. No?
>> >> >> >>
>> >> >> >> I can make a contrived litmus test for it, but you are right, the 
>> >> >> >> only
>> >> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
>> >> >> >> can
>> >> >> >> reorder the loads into p and q, and the read from p->val can, as 
>> >> >> >> you say,
>> >> >> >> come from either pointer.
>> >> >> >>
>> >> >> >> For whatever it is worth, hear is the litmus test:
>> >> >> >>
>> >> >> >> T1:   p = kmalloc(...);
>> >> >> >>   if (p == NULL)
>> >> >> >>   deal_with_it();
>> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
>> >> >> >>   p->b = 43;
>> >> >> >>   p->c = 44;
>> >> >> >>   atomic_store_explicit(, p, memory_order_release);
>> >> >> >>   p->b = 143;
>> >> >> >>   p->c = 144;
>> >> >> >>   atomic_store_explicit(, p, memory_order_release);
>> >> >> >>
>> >> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
>> >> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >> >> >>   q = atomic_load_explicit(, memory_order_consume);
>> >> >> >>   if (p == q) {
>> >> >> >>   /* The compiler decides that q->c is same as p->c. */
>> >> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >> >>   }
>> >> >> >>
>> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get 
>> >> >> >> what
>> >> >> >> you get.
>> >> >> >>
>> >> >> >> And publishing a structure via one RCU-protected pointer, updating 
>> >> >> >> it,
>> >> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> >> trouble anyway.  If you really want to do something like that and 
>> >> >> >> still
>> >> >> >> see consistency across all the fields in the structure, please put 
>> >> >> >> a lock
>> >> >> >> in the structure and use it to guard updates and accesses to those 
>> >> >> >> fields.
>> >> >> >
>> >> >> > And here is a patch documenting the restrictions for the current 
>> >> >> > Linux
>> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> >> > differently than atomic_load_explicit(, memory_order_consume).
>> >> >> >
>> >> >> > Thoughts?
>> >> >>
>> >> >> That might serve as informal documentation for linux kernel
>> >> >> programmers about the bounds on the optimisations that you expect
>> >> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> >> intend it to be for.   But I don't see how 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney
On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > when
> > > > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > > > + code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int force_zero_index = 1;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > > > compiled
> > > > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > > > PowerPC
> > > > > > > + do order stores after such branches, they can speculate loads,
> > > > > > > + which can result in misordering bugs.
> > > > > > > +
> > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > "!=",
> > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > + the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int flip_index = 0;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + As before, the reason this is buggy is that relational operators
> > > > > > > + are often compiled using branches.  And as before, although
> > > > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > + after such branches, but can speculate loads, which can again
> > > > > > > + result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > (unless
> > > > > > there are further constraints due to the type of r1 and the values 
> > > > > > that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > comparison?
> > > 
> > > That should work well assuming that things like "if", "while", and "?:"
> > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > through the program.  ;-)
> > > 
> > > The other thing that can happen is that a vdp can get handed off to
> > > another synchronization mechanism, for example, to reference counting:
> > > 
> > >   p = atomic_load_explicit(, memory_order_consume);
> > >   if (do_something_with(p->a)) {
> > >   /* fast path protected by RCU. */
> > >   return 0;
> > >   }
> > >   if (atomic_inc_not_zero(>refcnt) {
> > >   /* slow path protected by reference counting. */
> > >   return do_something_else_with((struct foo *)p);  /* CHANGE */
> > >   }
> > >   /* Needed slow path, but raced with deletion. */
> > >   return -EAGAIN;
> > > 
> > > I am guessing that the cast ends the vdp.  Is that the case?
> > 
> > And here is a more elaborate example from the Linux kernel:
> > 
> > struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> > 
> > rdev = rcu_dereference(conf->mirrors[disk].rdev);
> > if (r1_bio->bios[disk] == IO_BLOCKED
> > || rdev == NULL
> > || test_bit(Unmerged, >flags)
> > || test_bit(Faulty, >flags))
> > continue;
> > 
> > The fact that the "rdev == NULL" returns vdp does not force the "||"
> > operators to be evaluated arithmetically because the 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney
On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> 
> On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > > > +   dereferencing.  For example, the following (rather improbable)
> > > > > > +   code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int force_zero_index = 1;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > > > compiled
> > > > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > > > PowerPC
> > > > > > +   do order stores after such branches, they can speculate loads,
> > > > > > +   which can result in misordering bugs.
> > > > > > +
> > > > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +   the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int flip_index = 0;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   As before, the reason this is buggy is that relational operators
> > > > > > +   are often compiled using branches.  And as before, although
> > > > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +   after such branches, but can speculate loads, which can again
> > > > > > +   result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values 
> > > > > that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.
> 
> I currently don't see a reason why that should be disallowed.  If we
> have allowed an implicit conversion to non-vdp, I believe that should
> follow.

I am a bit nervous about a silent implicit conversion from vdp to
non-vdp in the general case.  However, when the result is being used by
a conditional, the silent implicit conversion makes a lot of sense.
Is that distinction something that the compiler can handle easily?

On the other hand, silent implicit conversion from non-vdp to vdp
is very useful for common code that can be invoked both by RCU
readers and by updaters.

>  ?: could be somewhat special, in that the type depends on the
> 2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
> allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
> disallowed if we don't provide for implicit casts from non-vdp to vdp.

Actually, from the Linux-kernel code that I am seeing, we want to be able
to silently convert from non-vdp to vdp in order to permit common code
that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
This common code must be compiled conservatively to allow vdp, but should
be just find with non-vdp.

Going through the combinations...

 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */
 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */
 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */
 3. vdp 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
> On 3 March 2014 20:44, Torvald Riegel  wrote:
> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney 
> >> >>  wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> >  wrote:
> >> >> >> > >
> >> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> >> > > where that other pointer was properly fetched using one
> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
> >> >> >> > > pointer
> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
> >> >> >> > > for
> >> >> >> > > that other pointer happened after the last update to the
> >> >> >> > > pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and 
> >> >> >> > saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >p = atomic_read(consume);
> >> >> >> >X;
> >> >> >> >q = atomic_read(consume);
> >> >> >> >Y;
> >> >> >> >if (p == q)
> >> >> >> > data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the 
> >> >> >> > compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this 
> >> >> >> > would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the 
> >> >> >> only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
> >> >> >> can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> >> say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>   if (p == NULL)
> >> >> >>   deal_with_it();
> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >> >>   p->b = 43;
> >> >> >>   p->c = 44;
> >> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >> >>   p->b = 143;
> >> >> >>   p->c = 144;
> >> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
> >> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>   q = atomic_load_explicit(, memory_order_consume);
> >> >> >>   if (p == q) {
> >> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>   }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get 
> >> >> >> what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating 
> >> >> >> it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and 
> >> >> >> still
> >> >> >> see consistency across all the fields in the structure, please put a 
> >> >> >> lock
> >> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> >> fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is what
> >> >> you really need.  As 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > > > +   dereferencing.  For example, the following (rather improbable)
> > > > > > +   code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int force_zero_index = 1;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > > > compiled
> > > > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > > > PowerPC
> > > > > > +   do order stores after such branches, they can speculate loads,
> > > > > > +   which can result in misordering bugs.
> > > > > > +
> > > > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +   the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int flip_index = 0;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   As before, the reason this is buggy is that relational operators
> > > > > > +   are often compiled using branches.  And as before, although
> > > > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +   after such branches, but can speculate loads, which can again
> > > > > > +   result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values 
> > > > > that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.  This assumes that p->a only returns
> > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > through the program.  ;-)
> > 
> > The other thing that can happen is that a vdp can get handed off to
> > another synchronization mechanism, for example, to reference counting:
> > 
> > p = atomic_load_explicit(, memory_order_consume);
> > if (do_something_with(p->a)) {
> > /* fast path protected by RCU. */
> > return 0;
> > }
> > if (atomic_inc_not_zero(>refcnt) {
> > /* slow path protected by reference counting. */
> > return do_something_else_with((struct foo *)p);  /* CHANGE */
> > }
> > /* Needed slow path, but raced with deletion. */
> > return -EAGAIN;
> > 
> > I am guessing that the cast ends the vdp.  Is that the case?
> 
> And here is a more elaborate example from the Linux kernel:
> 
>   struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> 
>   rdev = rcu_dereference(conf->mirrors[disk].rdev);
>   if (r1_bio->bios[disk] == IO_BLOCKED
>   || rdev == NULL
>   || test_bit(Unmerged, >flags)
>   || test_bit(Faulty, >flags))
>   continue;
> 
> The fact that the "rdev == NULL" returns vdp does not force the "||"
> operators to be evaluated arithmetically because the entire function
> is an "if" condition, correct?

That's a good question, and one that as far as I understand currently,
essentially boils down to whether we want to have tight restrictions on
which operations are still vdp.

If we look at 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +oDo not use the results from the boolean "&&" and "||" when
> > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > + code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int force_zero_index = 1;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > compiled
> > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > PowerPC
> > > > > + do order stores after such branches, they can speculate loads,
> > > > > + which can result in misordering bugs.
> > > > > +
> > > > > +oDo not use the results from relational operators ("==", "!=",
> > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > + the following (quite strange) code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int flip_index = 0;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > + As before, the reason this is buggy is that relational operators
> > > > > + are often compiled using branches.  And as before, although
> > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > + after such branches, but can speculate loads, which can again
> > > > > + result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.

I currently don't see a reason why that should be disallowed.  If we
have allowed an implicit conversion to non-vdp, I believe that should
follow.  ?: could be somewhat special, in that the type depends on the
2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
disallowed if we don't provide for implicit casts from non-vdp to vdp.

> This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)

That's a good question.  For the scheme I had in mind, I'm not concerned
about vdps running wild because one needs to assign to explicitly
vdp-typed variables (or function arguments, etc.) to let vdp extend to
beyond single expressions.

Nonetheless, I think it's a good question how -> should behave if the
field is not vdp; in particular, should vdp->non_vdp be automatically
vdp?  One concern might be that we know something about non-vdp -- OTOH,
we shouldn't be able to do so because we (assume to) don't know anything
about the vdp pointer, so we can't infer something about something it
points to.

> The other thing that can happen is that a vdp can get handed off to
> another synchronization mechanism, for example, to reference counting:
> 
>   p = atomic_load_explicit(, memory_order_consume);
>   if (do_something_with(p->a)) {
>   /* fast path protected by RCU. */
>   return 0;
>   }
>   if (atomic_inc_not_zero(>refcnt) {

Is the argument to atomic_inc_no_zero vdp or non-vdp?

>   /* slow path protected by reference counting. */
>   return do_something_else_with((struct foo *)p);  /* 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
  X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
  
  On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
 +oDo not use the results from the boolean  and || when
 + dereferencing.  For example, the following (rather improbable)
 + code is buggy:
 +
 + int a[2];
 + int index;
 + int force_zero_index = 1;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
 +
 + The reason this is buggy is that  and || are often 
 compiled
 + using branches.  While weak-memory machines such as ARM or 
 PowerPC
 + do order stores after such branches, they can speculate loads,
 + which can result in misordering bugs.
 +
 +oDo not use the results from relational operators (==, !=,
 + , =, , or =) when dereferencing.  For example,
 + the following (quite strange) code is buggy:
 +
 + int a[2];
 + int index;
 + int flip_index = 0;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1 != flip_index];  /* BUGGY!!! */
 +
 + As before, the reason this is buggy is that relational operators
 + are often compiled using branches.  And as before, although
 + weak-memory machines such as ARM or PowerPC do order stores
 + after such branches, but can speculate loads, which can again
 + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).
   
   And I am OK with the value_dep_preserving type providing more/better
   guarantees than we get by default from current compilers.
   
   One question, though.  Suppose that the code did not want a value
   dependency to be tracked through a comparison operator.  What does
   the developer do in that case?  (The reason I ask is that I have
   not yet found a use case in the Linux kernel that expects a value
   dependency to be tracked through a comparison.)
  
  Hmm.  I suppose use an explicit cast to non-vdp before or after the
  comparison?
 
 That should work well assuming that things like if, while, and ?:
 conditions are happy to take a vdp.

I currently don't see a reason why that should be disallowed.  If we
have allowed an implicit conversion to non-vdp, I believe that should
follow.  ?: could be somewhat special, in that the type depends on the
2nd and 3rd operand.  Thus, vdp x = non-vdp ? vdp : vdp; should be
allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be
disallowed if we don't provide for implicit casts from non-vdp to vdp.

 This assumes that p-a only returns
 vdp if field a is declared vdp, otherwise we have vdps running wild
 through the program.  ;-)

That's a good question.  For the scheme I had in mind, I'm not concerned
about vdps running wild because one needs to assign to explicitly
vdp-typed variables (or function arguments, etc.) to let vdp extend to
beyond single expressions.

Nonetheless, I think it's a good question how - should behave if the
field is not vdp; in particular, should vdp-non_vdp be automatically
vdp?  One concern might be that we know something about non-vdp -- OTOH,
we shouldn't be able to do so because we (assume to) don't know anything
about the vdp pointer, so we can't infer something about something it
points to.

 The other thing that can happen is that a vdp can get handed off to
 another synchronization mechanism, for example, to reference counting:
 
   p = atomic_load_explicit(gp, memory_order_consume);
   if (do_something_with(p-a)) {
   /* fast path protected by RCU. */
   return 0;
   }
   if (atomic_inc_not_zero(p-refcnt) {

Is the argument to atomic_inc_no_zero vdp or non-vdp?

   /* slow path protected by reference counting. */
   return do_something_else_with((struct foo *)p);  /* CHANGE */
   }
   /* Needed slow path, but raced with deletion. */
   return -EAGAIN;
 
 I am guessing that the cast ends the vdp.  Is that the case?

That would end it, yes.  The other way this could happen is that the
argument of do_something_else_with() would be specified to be non-vdp.

--
To unsubscribe from this 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
 On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
   X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
   
   On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
 
 On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
  +o  Do not use the results from the boolean  and || when
  +   dereferencing.  For example, the following (rather improbable)
  +   code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int force_zero_index = 1;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
  +
  +   The reason this is buggy is that  and || are often 
  compiled
  +   using branches.  While weak-memory machines such as ARM or 
  PowerPC
  +   do order stores after such branches, they can speculate loads,
  +   which can result in misordering bugs.
  +
  +o  Do not use the results from relational operators (==, !=,
  +   , =, , or =) when dereferencing.  For example,
  +   the following (quite strange) code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int flip_index = 0;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
  +
  +   As before, the reason this is buggy is that relational operators
  +   are often compiled using branches.  And as before, although
  +   weak-memory machines such as ARM or PowerPC do order stores
  +   after such branches, but can speculate loads, which can again
  +   result in misordering bugs.
 
 Those two would be allowed by the wording I have recently proposed,
 AFAICS.  r1 != flip_index would result in two possible values (unless
 there are further constraints due to the type of r1 and the values 
 that
 flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)
   
   Hmm.  I suppose use an explicit cast to non-vdp before or after the
   comparison?
  
  That should work well assuming that things like if, while, and ?:
  conditions are happy to take a vdp.  This assumes that p-a only returns
  vdp if field a is declared vdp, otherwise we have vdps running wild
  through the program.  ;-)
  
  The other thing that can happen is that a vdp can get handed off to
  another synchronization mechanism, for example, to reference counting:
  
  p = atomic_load_explicit(gp, memory_order_consume);
  if (do_something_with(p-a)) {
  /* fast path protected by RCU. */
  return 0;
  }
  if (atomic_inc_not_zero(p-refcnt) {
  /* slow path protected by reference counting. */
  return do_something_else_with((struct foo *)p);  /* CHANGE */
  }
  /* Needed slow path, but raced with deletion. */
  return -EAGAIN;
  
  I am guessing that the cast ends the vdp.  Is that the case?
 
 And here is a more elaborate example from the Linux kernel:
 
   struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
 
   rdev = rcu_dereference(conf-mirrors[disk].rdev);
   if (r1_bio-bios[disk] == IO_BLOCKED
   || rdev == NULL
   || test_bit(Unmerged, rdev-flags)
   || test_bit(Faulty, rdev-flags))
   continue;
 
 The fact that the rdev == NULL returns vdp does not force the ||
 operators to be evaluated arithmetically because the entire function
 is an if condition, correct?

That's a good question, and one that as far as I understand currently,
essentially boils down to whether we want to have tight restrictions on
which operations are still vdp.

If we look at the different combinations, then it seems we can't decide
on whether we have a value-dependency just due to a vdp type:
* non-vdp || vdp:  vdp iff non-vdp == false
* vdp || non-vdp:  vdp iff non-vdp == false?
* vdp || vdp: always vdp? (and dependency on both?)

I'm not sure it makes sense to try to not make all of those
vdp-by-default.  The first and second case show that it's dependent on
the specific execution anyway, and thus is 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel
On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
 On 3 March 2014 20:44, Torvald Riegel trie...@redhat.com wrote:
  On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
  On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
   On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
   Hi Paul,
  
   On 28 February 2014 18:50, Paul E. McKenney 
   paul...@linux.vnet.ibm.com wrote:
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which 
  pointer
  you use.  At least as long as the rcu_assign_pointer() 
  for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?

 I think that it might be worth pointing out as an example, and 
 saying
 that code like

p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;

 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the 
 compiler
 can decide that the values are interchangeable.

 I cannot for the life of me come up with a situation where this 
 would
 matter, though. If X contains a fence, then that fence will be a
 stronger ordering than anything the consume through p would
 guarantee anyway. And if X does *not* contain a fence, then the
 atomic reads of p and q are unordered *anyway*, so then whether the
 ordering to the access through p is through p or q is kind of
 irrelevant. No?
   
I can make a contrived litmus test for it, but you are right, the 
only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU 
can
reorder the loads into p and q, and the read from p-val can, as you 
say,
come from either pointer.
   
For whatever it is worth, hear is the litmus test:
   
T1:   p = kmalloc(...);
  if (p == NULL)
  deal_with_it();
  p-a = 42;  /* Each field in its own cache line. */
  p-b = 43;
  p-c = 44;
  atomic_store_explicit(gp1, p, memory_order_release);
  p-b = 143;
  p-c = 144;
  atomic_store_explicit(gp2, p, memory_order_release);
   
T2:   p = atomic_load_explicit(gp2, memory_order_consume);
  r1 = p-b;  /* Guaranteed to get 143. */
  q = atomic_load_explicit(gp1, memory_order_consume);
  if (p == q) {
  /* The compiler decides that q-c is same as p-c. */
  r2 = p-c; /* Could get 44 on weakly order system. */
  }
   
The loads from gp1 and gp2 are, as you say, unordered, so you get 
what
you get.
   
And publishing a structure via one RCU-protected pointer, updating 
it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and 
still
see consistency across all the fields in the structure, please put a 
lock
in the structure and use it to guard updates and accesses to those 
fields.
   
And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(p, memory_order_consume).
   
Thoughts?
  
   That might serve as informal documentation for linux kernel
   programmers about the bounds on the optimisations that you expect
   compilers to do for common-case RCU code - and I guess that's what you
   intend it to be for.   But I don't see how one can make it precise
   enough to serve as a language definition, so that compiler people
   could confidently say yes, we respect that, which I guess is what
   you really need.  As a useful criterion, we should aim for something
   precise enough that in a verified-compiler context you can
   mathematically prove that the compiler will satisfy it  (even though
   that won't happen anytime soon for GCC), and that analysis tool
   authors can actually know what they're working with.   All this stuff
   about you should avoid cancellation, and avoid masking with just a
   small number of bits is just too vague.
  
   Understood, and yes, this is intended to document current compiler
   behavior for the Linux kernel community.  It would not make sense to show
   it to the C11 or C++11 communities, except perhaps as an 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney
On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
 xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
 
 On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
   X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
   
   On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
 
 On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
  +o  Do not use the results from the boolean  and || when
  +   dereferencing.  For example, the following (rather improbable)
  +   code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int force_zero_index = 1;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
  +
  +   The reason this is buggy is that  and || are often 
  compiled
  +   using branches.  While weak-memory machines such as ARM or 
  PowerPC
  +   do order stores after such branches, they can speculate loads,
  +   which can result in misordering bugs.
  +
  +o  Do not use the results from relational operators (==, !=,
  +   , =, , or =) when dereferencing.  For example,
  +   the following (quite strange) code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int flip_index = 0;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
  +
  +   As before, the reason this is buggy is that relational operators
  +   are often compiled using branches.  And as before, although
  +   weak-memory machines such as ARM or PowerPC do order stores
  +   after such branches, but can speculate loads, which can again
  +   result in misordering bugs.
 
 Those two would be allowed by the wording I have recently proposed,
 AFAICS.  r1 != flip_index would result in two possible values (unless
 there are further constraints due to the type of r1 and the values 
 that
 flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)
   
   Hmm.  I suppose use an explicit cast to non-vdp before or after the
   comparison?
  
  That should work well assuming that things like if, while, and ?:
  conditions are happy to take a vdp.
 
 I currently don't see a reason why that should be disallowed.  If we
 have allowed an implicit conversion to non-vdp, I believe that should
 follow.

I am a bit nervous about a silent implicit conversion from vdp to
non-vdp in the general case.  However, when the result is being used by
a conditional, the silent implicit conversion makes a lot of sense.
Is that distinction something that the compiler can handle easily?

On the other hand, silent implicit conversion from non-vdp to vdp
is very useful for common code that can be invoked both by RCU
readers and by updaters.

  ?: could be somewhat special, in that the type depends on the
 2nd and 3rd operand.  Thus, vdp x = non-vdp ? vdp : vdp; should be
 allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be
 disallowed if we don't provide for implicit casts from non-vdp to vdp.

Actually, from the Linux-kernel code that I am seeing, we want to be able
to silently convert from non-vdp to vdp in order to permit common code
that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
This common code must be compiled conservatively to allow vdp, but should
be just find with non-vdp.

Going through the combinations...

 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */
 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */
 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */
 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */
 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */
 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */
 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */
 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */
 8. non-vdp x = vdp ? vdp : vdp;   

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney
On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
 On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
  On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)

On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
  X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
  
  On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
   +oDo not use the results from the boolean  and || 
   when
   + dereferencing.  For example, the following (rather improbable)
   + code is buggy:
   +
   + int a[2];
   + int index;
   + int force_zero_index = 1;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
   +
   + The reason this is buggy is that  and || are often 
   compiled
   + using branches.  While weak-memory machines such as ARM or 
   PowerPC
   + do order stores after such branches, they can speculate loads,
   + which can result in misordering bugs.
   +
   +oDo not use the results from relational operators (==, 
   !=,
   + , =, , or =) when dereferencing.  For example,
   + the following (quite strange) code is buggy:
   +
   + int a[2];
   + int index;
   + int flip_index = 0;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1 != flip_index];  /* BUGGY!!! */
   +
   + As before, the reason this is buggy is that relational operators
   + are often compiled using branches.  And as before, although
   + weak-memory machines such as ARM or PowerPC do order stores
   + after such branches, but can speculate loads, which can again
   + result in misordering bugs.
  
  Those two would be allowed by the wording I have recently proposed,
  AFAICS.  r1 != flip_index would result in two possible values 
  (unless
  there are further constraints due to the type of r1 and the values 
  that
  flip_index can have).
 
 And I am OK with the value_dep_preserving type providing more/better
 guarantees than we get by default from current compilers.
 
 One question, though.  Suppose that the code did not want a value
 dependency to be tracked through a comparison operator.  What does
 the developer do in that case?  (The reason I ask is that I have
 not yet found a use case in the Linux kernel that expects a value
 dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?
   
   That should work well assuming that things like if, while, and ?:
   conditions are happy to take a vdp.  This assumes that p-a only returns
   vdp if field a is declared vdp, otherwise we have vdps running wild
   through the program.  ;-)
   
   The other thing that can happen is that a vdp can get handed off to
   another synchronization mechanism, for example, to reference counting:
   
 p = atomic_load_explicit(gp, memory_order_consume);
 if (do_something_with(p-a)) {
 /* fast path protected by RCU. */
 return 0;
 }
 if (atomic_inc_not_zero(p-refcnt) {
 /* slow path protected by reference counting. */
 return do_something_else_with((struct foo *)p);  /* CHANGE */
 }
 /* Needed slow path, but raced with deletion. */
 return -EAGAIN;
   
   I am guessing that the cast ends the vdp.  Is that the case?
  
  And here is a more elaborate example from the Linux kernel:
  
  struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
  
  rdev = rcu_dereference(conf-mirrors[disk].rdev);
  if (r1_bio-bios[disk] == IO_BLOCKED
  || rdev == NULL
  || test_bit(Unmerged, rdev-flags)
  || test_bit(Faulty, rdev-flags))
  continue;
  
  The fact that the rdev == NULL returns vdp does not force the ||
  operators to be evaluated arithmetically because the entire function
  is an if condition, correct?
 
 That's a good question, and one that as far as I understand currently,
 essentially boils down to whether we want to have tight restrictions on
 which operations are still vdp.
 
 If we look at the different combinations, then it seems we can't decide
 on whether we have a value-dependency just due to a vdp type:
 * non-vdp || vdp:  vdp iff non-vdp == false
 * vdp || non-vdp:  vdp iff non-vdp == false?
 * vdp || vdp: always vdp? (and dependency on both?)
 
 I'm not sure it makes sense to try to not make 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Peter Sewell
On 5 March 2014 17:15, Torvald Riegel trie...@redhat.com wrote:
 On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
 On 3 March 2014 20:44, Torvald Riegel trie...@redhat.com wrote:
  On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
  On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com 
  wrote:
   On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
   Hi Paul,
  
   On 28 February 2014 18:50, Paul E. McKenney 
   paul...@linux.vnet.ibm.com wrote:
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected 
  pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which 
  pointer
  you use.  At least as long as the rcu_assign_pointer() 
  for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?

 I think that it might be worth pointing out as an example, and 
 saying
 that code like

p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;

 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the 
 compiler
 can decide that the values are interchangeable.

 I cannot for the life of me come up with a situation where this 
 would
 matter, though. If X contains a fence, then that fence will be a
 stronger ordering than anything the consume through p would
 guarantee anyway. And if X does *not* contain a fence, then the
 atomic reads of p and q are unordered *anyway*, so then whether 
 the
 ordering to the access through p is through p or q is kind of
 irrelevant. No?
   
I can make a contrived litmus test for it, but you are right, the 
only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU 
can
reorder the loads into p and q, and the read from p-val can, as 
you say,
come from either pointer.
   
For whatever it is worth, hear is the litmus test:
   
T1:   p = kmalloc(...);
  if (p == NULL)
  deal_with_it();
  p-a = 42;  /* Each field in its own cache line. */
  p-b = 43;
  p-c = 44;
  atomic_store_explicit(gp1, p, memory_order_release);
  p-b = 143;
  p-c = 144;
  atomic_store_explicit(gp2, p, memory_order_release);
   
T2:   p = atomic_load_explicit(gp2, memory_order_consume);
  r1 = p-b;  /* Guaranteed to get 143. */
  q = atomic_load_explicit(gp1, memory_order_consume);
  if (p == q) {
  /* The compiler decides that q-c is same as p-c. */
  r2 = p-c; /* Could get 44 on weakly order system. */
  }
   
The loads from gp1 and gp2 are, as you say, unordered, so you get 
what
you get.
   
And publishing a structure via one RCU-protected pointer, updating 
it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and 
still
see consistency across all the fields in the structure, please put 
a lock
in the structure and use it to guard updates and accesses to those 
fields.
   
And here is a patch documenting the restrictions for the current 
Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(p, memory_order_consume).
   
Thoughts?
  
   That might serve as informal documentation for linux kernel
   programmers about the bounds on the optimisations that you expect
   compilers to do for common-case RCU code - and I guess that's what you
   intend it to be for.   But I don't see how one can make it precise
   enough to serve as a language definition, so that compiler people
   could confidently say yes, we respect that, which I guess is what
   you really need.  As a useful criterion, we should aim for something
   precise enough that in a verified-compiler context you can
   mathematically prove that the compiler will satisfy it  (even though
   that won't happen anytime soon for GCC), and that analysis tool
   authors can actually know what they're working with.   All this stuff
   about you should avoid cancellation, and avoid masking with just a
   small number of bits is just too vague.
  
   Understood, and yes, this is intended to document current compiler
   behavior for the Linux kernel community.  It would not make 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Peter Sewell
On 3 March 2014 20:44, Torvald Riegel  wrote:
> On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
>> On 1 March 2014 08:03, Paul E. McKenney  wrote:
>> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> Hi Paul,
>> >>
>> >> On 28 February 2014 18:50, Paul E. McKenney  
>> >> wrote:
>> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >  wrote:
>> >> >> > >
>> >> >> > > 3.  The comparison was against another RCU-protected pointer,
>> >> >> > > where that other pointer was properly fetched using one
>> >> >> > > of the RCU primitives.  Here it doesn't matter which 
>> >> >> > > pointer
>> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
>> >> >> > > that other pointer happened after the last update to the
>> >> >> > > pointed-to structure.
>> >> >> > >
>> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >
>> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> > that code like
>> >> >> >
>> >> >> >p = atomic_read(consume);
>> >> >> >X;
>> >> >> >q = atomic_read(consume);
>> >> >> >Y;
>> >> >> >if (p == q)
>> >> >> > data = p->val;
>> >> >> >
>> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> > can decide that the values are interchangeable.
>> >> >> >
>> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> > irrelevant. No?
>> >> >>
>> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> reorder the loads into p and q, and the read from p->val can, as you 
>> >> >> say,
>> >> >> come from either pointer.
>> >> >>
>> >> >> For whatever it is worth, hear is the litmus test:
>> >> >>
>> >> >> T1:   p = kmalloc(...);
>> >> >>   if (p == NULL)
>> >> >>   deal_with_it();
>> >> >>   p->a = 42;  /* Each field in its own cache line. */
>> >> >>   p->b = 43;
>> >> >>   p->c = 44;
>> >> >>   atomic_store_explicit(, p, memory_order_release);
>> >> >>   p->b = 143;
>> >> >>   p->c = 144;
>> >> >>   atomic_store_explicit(, p, memory_order_release);
>> >> >>
>> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
>> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >> >>   q = atomic_load_explicit(, memory_order_consume);
>> >> >>   if (p == q) {
>> >> >>   /* The compiler decides that q->c is same as p->c. */
>> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >>   }
>> >> >>
>> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> you get.
>> >> >>
>> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> see consistency across all the fields in the structure, please put a 
>> >> >> lock
>> >> >> in the structure and use it to guard updates and accesses to those 
>> >> >> fields.
>> >> >
>> >> > And here is a patch documenting the restrictions for the current Linux
>> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> > differently than atomic_load_explicit(, memory_order_consume).
>> >> >
>> >> > Thoughts?
>> >>
>> >> That might serve as informal documentation for linux kernel
>> >> programmers about the bounds on the optimisations that you expect
>> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> intend it to be for.   But I don't see how one can make it precise
>> >> enough to serve as a language definition, so that compiler people
>> >> could confidently say "yes, we respect that", which I guess is what
>> >> you really need.  As a useful criterion, we should aim for something
>> >> precise enough that in a verified-compiler context you can
>> >> mathematically prove that the compiler will satisfy it  (even though
>> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> authors can actually know what they're working with.   All this stuff
>> >> about "you should avoid 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +oDo not use the results from the boolean "&&" and "||" when
> > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > + code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int force_zero_index = 1;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > compiled
> > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > PowerPC
> > > > > + do order stores after such branches, they can speculate loads,
> > > > > + which can result in misordering bugs.
> > > > > +
> > > > > +oDo not use the results from relational operators ("==", "!=",
> > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > + the following (quite strange) code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int flip_index = 0;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > + As before, the reason this is buggy is that relational operators
> > > > > + are often compiled using branches.  And as before, although
> > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > + after such branches, but can speculate loads, which can again
> > > > > + result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.  This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)
> 
> The other thing that can happen is that a vdp can get handed off to
> another synchronization mechanism, for example, to reference counting:
> 
>   p = atomic_load_explicit(, memory_order_consume);
>   if (do_something_with(p->a)) {
>   /* fast path protected by RCU. */
>   return 0;
>   }
>   if (atomic_inc_not_zero(>refcnt) {
>   /* slow path protected by reference counting. */
>   return do_something_else_with((struct foo *)p);  /* CHANGE */
>   }
>   /* Needed slow path, but raced with deletion. */
>   return -EAGAIN;
> 
> I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

rdev = rcu_dereference(conf->mirrors[disk].rdev);
if (r1_bio->bios[disk] == IO_BLOCKED
|| rdev == NULL
|| test_bit(Unmerged, >flags)
|| test_bit(Faulty, >flags))
continue;

The fact that the "rdev == NULL" returns vdp does not force the "||"
operators to be evaluated arithmetically because the entire function
is an "if" condition, correct?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> 
> On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > 
> > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > +   dereferencing.  For example, the following (rather improbable)
> > > > +   code is buggy:
> > > > +
> > > > +   int a[2];
> > > > +   int index;
> > > > +   int force_zero_index = 1;
> > > > +
> > > > +   ...
> > > > +
> > > > +   r1 = rcu_dereference(i1)
> > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > +
> > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > compiled
> > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > PowerPC
> > > > +   do order stores after such branches, they can speculate loads,
> > > > +   which can result in misordering bugs.
> > > > +
> > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > +   the following (quite strange) code is buggy:
> > > > +
> > > > +   int a[2];
> > > > +   int index;
> > > > +   int flip_index = 0;
> > > > +
> > > > +   ...
> > > > +
> > > > +   r1 = rcu_dereference(i1)
> > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > +
> > > > +   As before, the reason this is buggy is that relational operators
> > > > +   are often compiled using branches.  And as before, although
> > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > +   after such branches, but can speculate loads, which can again
> > > > +   result in misordering bugs.
> > > 
> > > Those two would be allowed by the wording I have recently proposed,
> > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > there are further constraints due to the type of r1 and the values that
> > > flip_index can have).
> > 
> > And I am OK with the value_dep_preserving type providing more/better
> > guarantees than we get by default from current compilers.
> > 
> > One question, though.  Suppose that the code did not want a value
> > dependency to be tracked through a comparison operator.  What does
> > the developer do in that case?  (The reason I ask is that I have
> > not yet found a use case in the Linux kernel that expects a value
> > dependency to be tracked through a comparison.)
> 
> Hmm.  I suppose use an explicit cast to non-vdp before or after the
> comparison?

That should work well assuming that things like "if", "while", and "?:"
conditions are happy to take a vdp.  This assumes that p->a only returns
vdp if field "a" is declared vdp, otherwise we have vdps running wild
through the program.  ;-)

The other thing that can happen is that a vdp can get handed off to
another synchronization mechanism, for example, to reference counting:

p = atomic_load_explicit(, memory_order_consume);
if (do_something_with(p->a)) {
/* fast path protected by RCU. */
return 0;
}
if (atomic_inc_not_zero(>refcnt) {
/* slow path protected by reference counting. */
return do_something_else_with((struct foo *)p);  /* CHANGE */
}
/* Needed slow path, but raced with deletion. */
return -EAGAIN;

I am guessing that the cast ends the vdp.  Is that the case?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
   
   On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
+o  Do not use the results from the boolean  and || when
+   dereferencing.  For example, the following (rather improbable)
+   code is buggy:
+
+   int a[2];
+   int index;
+   int force_zero_index = 1;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
+
+   The reason this is buggy is that  and || are often 
compiled
+   using branches.  While weak-memory machines such as ARM or 
PowerPC
+   do order stores after such branches, they can speculate loads,
+   which can result in misordering bugs.
+
+o  Do not use the results from relational operators (==, !=,
+   , =, , or =) when dereferencing.  For example,
+   the following (quite strange) code is buggy:
+
+   int a[2];
+   int index;
+   int flip_index = 0;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1 != flip_index];  /* BUGGY!!! */
+
+   As before, the reason this is buggy is that relational operators
+   are often compiled using branches.  And as before, although
+   weak-memory machines such as ARM or PowerPC do order stores
+   after such branches, but can speculate loads, which can again
+   result in misordering bugs.
   
   Those two would be allowed by the wording I have recently proposed,
   AFAICS.  r1 != flip_index would result in two possible values (unless
   there are further constraints due to the type of r1 and the values that
   flip_index can have).
  
  And I am OK with the value_dep_preserving type providing more/better
  guarantees than we get by default from current compilers.
  
  One question, though.  Suppose that the code did not want a value
  dependency to be tracked through a comparison operator.  What does
  the developer do in that case?  (The reason I ask is that I have
  not yet found a use case in the Linux kernel that expects a value
  dependency to be tracked through a comparison.)
 
 Hmm.  I suppose use an explicit cast to non-vdp before or after the
 comparison?

That should work well assuming that things like if, while, and ?:
conditions are happy to take a vdp.  This assumes that p-a only returns
vdp if field a is declared vdp, otherwise we have vdps running wild
through the program.  ;-)

The other thing that can happen is that a vdp can get handed off to
another synchronization mechanism, for example, to reference counting:

p = atomic_load_explicit(gp, memory_order_consume);
if (do_something_with(p-a)) {
/* fast path protected by RCU. */
return 0;
}
if (atomic_inc_not_zero(p-refcnt) {
/* slow path protected by reference counting. */
return do_something_else_with((struct foo *)p);  /* CHANGE */
}
/* Needed slow path, but raced with deletion. */
return -EAGAIN;

I am guessing that the cast ends the vdp.  Is that the case?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
  X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
  
  On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
 +oDo not use the results from the boolean  and || when
 + dereferencing.  For example, the following (rather improbable)
 + code is buggy:
 +
 + int a[2];
 + int index;
 + int force_zero_index = 1;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
 +
 + The reason this is buggy is that  and || are often 
 compiled
 + using branches.  While weak-memory machines such as ARM or 
 PowerPC
 + do order stores after such branches, they can speculate loads,
 + which can result in misordering bugs.
 +
 +oDo not use the results from relational operators (==, !=,
 + , =, , or =) when dereferencing.  For example,
 + the following (quite strange) code is buggy:
 +
 + int a[2];
 + int index;
 + int flip_index = 0;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1 != flip_index];  /* BUGGY!!! */
 +
 + As before, the reason this is buggy is that relational operators
 + are often compiled using branches.  And as before, although
 + weak-memory machines such as ARM or PowerPC do order stores
 + after such branches, but can speculate loads, which can again
 + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).
   
   And I am OK with the value_dep_preserving type providing more/better
   guarantees than we get by default from current compilers.
   
   One question, though.  Suppose that the code did not want a value
   dependency to be tracked through a comparison operator.  What does
   the developer do in that case?  (The reason I ask is that I have
   not yet found a use case in the Linux kernel that expects a value
   dependency to be tracked through a comparison.)
  
  Hmm.  I suppose use an explicit cast to non-vdp before or after the
  comparison?
 
 That should work well assuming that things like if, while, and ?:
 conditions are happy to take a vdp.  This assumes that p-a only returns
 vdp if field a is declared vdp, otherwise we have vdps running wild
 through the program.  ;-)
 
 The other thing that can happen is that a vdp can get handed off to
 another synchronization mechanism, for example, to reference counting:
 
   p = atomic_load_explicit(gp, memory_order_consume);
   if (do_something_with(p-a)) {
   /* fast path protected by RCU. */
   return 0;
   }
   if (atomic_inc_not_zero(p-refcnt) {
   /* slow path protected by reference counting. */
   return do_something_else_with((struct foo *)p);  /* CHANGE */
   }
   /* Needed slow path, but raced with deletion. */
   return -EAGAIN;
 
 I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

rdev = rcu_dereference(conf-mirrors[disk].rdev);
if (r1_bio-bios[disk] == IO_BLOCKED
|| rdev == NULL
|| test_bit(Unmerged, rdev-flags)
|| test_bit(Faulty, rdev-flags))
continue;

The fact that the rdev == NULL returns vdp does not force the ||
operators to be evaluated arithmetically because the entire function
is an if condition, correct?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Peter Sewell
On 3 March 2014 20:44, Torvald Riegel trie...@redhat.com wrote:
 On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
 On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
  Hi Paul,
 
  On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
  wrote:
   On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
   On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 3.  The comparison was against another RCU-protected pointer,
 where that other pointer was properly fetched using one
 of the RCU primitives.  Here it doesn't matter which 
 pointer
 you use.  At least as long as the rcu_assign_pointer() for
 that other pointer happened after the last update to the
 pointed-to structure.

 I am a bit nervous about #3.  Any thoughts on it?
   
I think that it might be worth pointing out as an example, and saying
that code like
   
   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p-val;
   
then the access of p-val is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.
   
I cannot for the life of me come up with a situation where this would
matter, though. If X contains a fence, then that fence will be a
stronger ordering than anything the consume through p would
guarantee anyway. And if X does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through p is through p or q is kind of
irrelevant. No?
  
   I can make a contrived litmus test for it, but you are right, the only
   time you can see it happen is when X has no barriers, in which case
   you don't have any ordering anyway -- both the compiler and the CPU can
   reorder the loads into p and q, and the read from p-val can, as you 
   say,
   come from either pointer.
  
   For whatever it is worth, hear is the litmus test:
  
   T1:   p = kmalloc(...);
 if (p == NULL)
 deal_with_it();
 p-a = 42;  /* Each field in its own cache line. */
 p-b = 43;
 p-c = 44;
 atomic_store_explicit(gp1, p, memory_order_release);
 p-b = 143;
 p-c = 144;
 atomic_store_explicit(gp2, p, memory_order_release);
  
   T2:   p = atomic_load_explicit(gp2, memory_order_consume);
 r1 = p-b;  /* Guaranteed to get 143. */
 q = atomic_load_explicit(gp1, memory_order_consume);
 if (p == q) {
 /* The compiler decides that q-c is same as p-c. */
 r2 = p-c; /* Could get 44 on weakly order system. */
 }
  
   The loads from gp1 and gp2 are, as you say, unordered, so you get what
   you get.
  
   And publishing a structure via one RCU-protected pointer, updating it,
   then publishing it via another pointer seems to me to be asking for
   trouble anyway.  If you really want to do something like that and still
   see consistency across all the fields in the structure, please put a 
   lock
   in the structure and use it to guard updates and accesses to those 
   fields.
  
   And here is a patch documenting the restrictions for the current Linux
   kernel.  The rules change a bit due to rcu_dereference() acting a bit
   differently than atomic_load_explicit(p, memory_order_consume).
  
   Thoughts?
 
  That might serve as informal documentation for linux kernel
  programmers about the bounds on the optimisations that you expect
  compilers to do for common-case RCU code - and I guess that's what you
  intend it to be for.   But I don't see how one can make it precise
  enough to serve as a language definition, so that compiler people
  could confidently say yes, we respect that, which I guess is what
  you really need.  As a useful criterion, we should aim for something
  precise enough that in a verified-compiler context you can
  mathematically prove that the compiler will satisfy it  (even though
  that won't happen anytime soon for GCC), and that analysis tool
  authors can actually know what they're working with.   All this stuff
  about you should avoid cancellation, and avoid masking with just a
  small number of bits is just too vague.
 
  Understood, and yes, this is intended to document current compiler
  behavior for the Linux kernel community.  It would not make sense to show
  it to the C11 or C++11 communities, except perhaps as an informational
  piece on current practice.
 
  The basic problem is that the compiler may be doing sophisticated
  reasoning with a bunch of non-local knowledge that it's deduced from
  the code, neither of which are 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > 
> > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > +oDo not use the results from the boolean "&&" and "||" when
> > > + dereferencing.  For example, the following (rather improbable)
> > > + code is buggy:
> > > +
> > > + int a[2];
> > > + int index;
> > > + int force_zero_index = 1;
> > > +
> > > + ...
> > > +
> > > + r1 = rcu_dereference(i1)
> > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > +
> > > + The reason this is buggy is that "&&" and "||" are often compiled
> > > + using branches.  While weak-memory machines such as ARM or PowerPC
> > > + do order stores after such branches, they can speculate loads,
> > > + which can result in misordering bugs.
> > > +
> > > +oDo not use the results from relational operators ("==", "!=",
> > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > + the following (quite strange) code is buggy:
> > > +
> > > + int a[2];
> > > + int index;
> > > + int flip_index = 0;
> > > +
> > > + ...
> > > +
> > > + r1 = rcu_dereference(i1)
> > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > +
> > > + As before, the reason this is buggy is that relational operators
> > > + are often compiled using branches.  And as before, although
> > > + weak-memory machines such as ARM or PowerPC do order stores
> > > + after such branches, but can speculate loads, which can again
> > > + result in misordering bugs.
> > 
> > Those two would be allowed by the wording I have recently proposed,
> > AFAICS.  r1 != flip_index would result in two possible values (unless
> > there are further constraints due to the type of r1 and the values that
> > flip_index can have).
> 
> And I am OK with the value_dep_preserving type providing more/better
> guarantees than we get by default from current compilers.
> 
> One question, though.  Suppose that the code did not want a value
> dependency to be tracked through a comparison operator.  What does
> the developer do in that case?  (The reason I ask is that I have
> not yet found a use case in the Linux kernel that expects a value
> dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney  
> >> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >  wrote:
> >> >> > >
> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> > > where that other pointer was properly fetched using one
> >> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> >> > > that other pointer happened after the last update to the
> >> >> > > pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >p = atomic_read(consume);
> >> >> >X;
> >> >> >q = atomic_read(consume);
> >> >> >Y;
> >> >> >if (p == q)
> >> >> > data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>   if (p == NULL)
> >> >>   deal_with_it();
> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >>   p->b = 43;
> >> >>   p->c = 44;
> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >>   p->b = 143;
> >> >>   p->c = 144;
> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >>   q = atomic_load_explicit(, memory_order_consume);
> >> >>   if (p == q) {
> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>   }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a 
> >> >> lock
> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff
> >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is intended to document current 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Paul E. McKenney
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> 
> On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > +o  Do not use the results from the boolean "&&" and "||" when
> > +   dereferencing.  For example, the following (rather improbable)
> > +   code is buggy:
> > +
> > +   int a[2];
> > +   int index;
> > +   int force_zero_index = 1;
> > +
> > +   ...
> > +
> > +   r1 = rcu_dereference(i1)
> > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > +
> > +   The reason this is buggy is that "&&" and "||" are often compiled
> > +   using branches.  While weak-memory machines such as ARM or PowerPC
> > +   do order stores after such branches, they can speculate loads,
> > +   which can result in misordering bugs.
> > +
> > +o  Do not use the results from relational operators ("==", "!=",
> > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > +   the following (quite strange) code is buggy:
> > +
> > +   int a[2];
> > +   int index;
> > +   int flip_index = 0;
> > +
> > +   ...
> > +
> > +   r1 = rcu_dereference(i1)
> > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > +
> > +   As before, the reason this is buggy is that relational operators
> > +   are often compiled using branches.  And as before, although
> > +   weak-memory machines such as ARM or PowerPC do order stores
> > +   after such branches, but can speculate loads, which can again
> > +   result in misordering bugs.
> 
> Those two would be allowed by the wording I have recently proposed,
> AFAICS.  r1 != flip_index would result in two possible values (unless
> there are further constraints due to the type of r1 and the values that
> flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)

> I don't think the wording is flawed.  We could raise the requirement of
> having more than one value left for r1 to having more than N with N > 1
> values left, but the fundamental problem remains in that a compiler
> could try to generate a (big) switch statement.
> 
> Instead, I think that this indicates that the value_dep_preserving type
> modifier would be useful: It would tell the compiler that it shouldn't
> transform this into a branch in this case, yet allow that optimization
> for all other code.

Understood!

BTW, my current task is generating examples using the value_dep_preserving
type for RCU-protected array indexes.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> +oDo not use the results from the boolean "&&" and "||" when
> + dereferencing.  For example, the following (rather improbable)
> + code is buggy:
> +
> + int a[2];
> + int index;
> + int force_zero_index = 1;
> +
> + ...
> +
> + r1 = rcu_dereference(i1)
> + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> +
> + The reason this is buggy is that "&&" and "||" are often compiled
> + using branches.  While weak-memory machines such as ARM or PowerPC
> + do order stores after such branches, they can speculate loads,
> + which can result in misordering bugs.
> +
> +oDo not use the results from relational operators ("==", "!=",
> + ">", ">=", "<", or "<=") when dereferencing.  For example,
> + the following (quite strange) code is buggy:
> +
> + int a[2];
> + int index;
> + int flip_index = 0;
> +
> + ...
> +
> + r1 = rcu_dereference(i1)
> + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> +
> + As before, the reason this is buggy is that relational operators
> + are often compiled using branches.  And as before, although
> + weak-memory machines such as ARM or PowerPC do order stores
> + after such branches, but can speculate loads, which can again
> + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).

I don't think the wording is flawed.  We could raise the requirement of
having more than one value left for r1 to having more than N with N > 1
values left, but the fundamental problem remains in that a compiler
could try to generate a (big) switch statement.

Instead, I think that this indicates that the value_dep_preserving type
modifier would be useful: It would tell the compiler that it shouldn't
transform this into a branch in this case, yet allow that optimization
for all other code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote:
> Your proposal looks quite promising at first glance.  But rather than
> try and comment on it immediately, I am going to take a number of uses of
> RCU from the Linux kernel and apply your proposal to them, then respond
> with the results
> 
> Fair enough?

Sure.  Thanks for doing the cross-check!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>  wrote:
> >
> > 3.  The comparison was against another RCU-protected pointer,
> > where that other pointer was properly fetched using one
> > of the RCU primitives.  Here it doesn't matter which pointer
> > you use.  At least as long as the rcu_assign_pointer() for
> > that other pointer happened after the last update to the
> > pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>p = atomic_read(consume);
>X;
>q = atomic_read(consume);
>Y;
>if (p == q)
> data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.

The wording I proposed would make the p dereference have a value
dependency unless X and Y would somehow restrict p and q.  The reasoning
is that if the atomic loads return potentially more than one value, then
even if we find out that two such loads did return the same value, we
still don't know what the exact value was.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == ) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.
> 
> And the only such thing I can think of is basically compiler-initiated
> value-prediction, presumably directed by PGO (since now if the value
> prediction is in the source code, it's considered to break the value
> chain).

The other example that comes to mind would be feedback-directed JIT
compilation.  I don't think that's widely used today, and it might never
be for the kernel -- but *in the standard*, we at least have to consider
what the future might bring.

> The good thing is that afaik, value-prediction is largely not used in
> real life, afaik. There are lots of papers on it, but I don't think
> anybody actually does it (although I can easily see some
> specint-specific optimization pattern that is build up around it).
> 
> And even value prediction is actually fine, as long as the compiler
> can see the memory *source* of the value prediction (and it isn't a
> mo_consume). So it really ends up limiting your value prediction in
> very simple ways: you cannot do it to function arguments if they are
> registers. But you can still do value prediction on values you loaded
> from memory, if you can actually *see* that memory op.

I think one would need to show that the source is *not even indirectly*
a mo_consume load.  With the wording I proposed, value dependencies
don't break when storing to / loading from memory locations.

Thus, if a compiler ends up at a memory load after waling SSA, it needs
to prove that the load cannot read a value that (1) was produced by a
store sequenced-before the load and (2) might carry a value dependency
(e.g., by being a mo_consume load) that the value prediction in question
would break.  This, in general, requires alias analysis.
Deciding whether a prediction would break a value dependency has to
consider what later stages in a compiler would be doing, including LTO
or further rounds of inlining/optimizations.  OTOH, if the compiler can
treat an mo_consume load as returning all possible values (eg, by
ignoring all knowledge about it), then it can certainly do so with other
memory loads too.

So, I think that the constraints due to value dependencies can matter in
practice.  However, the impact on optimizations on
non-mo_consume-related code are hard to estimate -- I don't see a huge
amount of impact right now, but I also wouldn't want to predict that
this can't change in the future.

> Of course, on more strongly ordered CPU's, even that "register
> argument" limitation goes away.
> 
> So I agree that there is basically no real optimization constraint.
> Value-prediction is of dubious value to begin with, and the actual
> constraint on its use if some compiler writer really wants to is not
> onerous.
> 
> > What I have in mind is roughly the following (totally made-up syntax --
> > suggestions for how to do this properly are very welcome):
> > * Have a type modifier (eg, like restrict), that specifies that
> > operations on data of this type are preserving value dependencies:
> 
> So I'm not violently opposed, but I think the upsides are not great.
> Note that my earlier suggestion to use "restrict" wasn't because I
> believed the annotation itself would be visible, but basically just as
> a legalistic promise to the compiler that *if* it found an alias, then
> it didn't need to worry about ordering. So to me, that type modifier
> was about conceptual guarantees, not about actual value chains.
> 
> Anyway, the reason I don't believe any type modifier (and
> "[[carries_dependency]]" is basically just that) is worth it is simply
> that it adds a real burden on the programmer, without actually giving
> the programmer any real upside:
> 
> Within a single function, the compiler already sees that mo_consume
> source, and so doing a type-based restriction doesn't really help. The
> information is already 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote:
  Regarding the latter, we make a fresh start at each mo_consume load (ie,
  we assume we know nothing -- L could have returned any possible value);
  I believe this is easier to reason about than other scopes like function
  granularities (what happens on inlining?), or translation units.  It
  should also be simple to implement for compilers, and would hopefully
  not constrain optimization too much.
 
  [...]
 
  Paul's litmus test would work, because we guarantee to the programmer
  that it can assume that the mo_consume load would return any value
  allowed by the type; effectively, this forbids the compiler analysis
  Paul thought about:
 
 So realistically, since with the new wording we can ignore the silly
 cases (ie p-p) and we can ignore the trivial-to-optimize compiler
 cases (if (p == variable) .. use p), and you would forbid the
 global value range optimization case that Paul bright up, what
 remains would seem to be just really subtle compiler transformations
 of data dependencies to control dependencies.
 
 And the only such thing I can think of is basically compiler-initiated
 value-prediction, presumably directed by PGO (since now if the value
 prediction is in the source code, it's considered to break the value
 chain).

The other example that comes to mind would be feedback-directed JIT
compilation.  I don't think that's widely used today, and it might never
be for the kernel -- but *in the standard*, we at least have to consider
what the future might bring.

 The good thing is that afaik, value-prediction is largely not used in
 real life, afaik. There are lots of papers on it, but I don't think
 anybody actually does it (although I can easily see some
 specint-specific optimization pattern that is build up around it).
 
 And even value prediction is actually fine, as long as the compiler
 can see the memory *source* of the value prediction (and it isn't a
 mo_consume). So it really ends up limiting your value prediction in
 very simple ways: you cannot do it to function arguments if they are
 registers. But you can still do value prediction on values you loaded
 from memory, if you can actually *see* that memory op.

I think one would need to show that the source is *not even indirectly*
a mo_consume load.  With the wording I proposed, value dependencies
don't break when storing to / loading from memory locations.

Thus, if a compiler ends up at a memory load after waling SSA, it needs
to prove that the load cannot read a value that (1) was produced by a
store sequenced-before the load and (2) might carry a value dependency
(e.g., by being a mo_consume load) that the value prediction in question
would break.  This, in general, requires alias analysis.
Deciding whether a prediction would break a value dependency has to
consider what later stages in a compiler would be doing, including LTO
or further rounds of inlining/optimizations.  OTOH, if the compiler can
treat an mo_consume load as returning all possible values (eg, by
ignoring all knowledge about it), then it can certainly do so with other
memory loads too.

So, I think that the constraints due to value dependencies can matter in
practice.  However, the impact on optimizations on
non-mo_consume-related code are hard to estimate -- I don't see a huge
amount of impact right now, but I also wouldn't want to predict that
this can't change in the future.

 Of course, on more strongly ordered CPU's, even that register
 argument limitation goes away.
 
 So I agree that there is basically no real optimization constraint.
 Value-prediction is of dubious value to begin with, and the actual
 constraint on its use if some compiler writer really wants to is not
 onerous.
 
  What I have in mind is roughly the following (totally made-up syntax --
  suggestions for how to do this properly are very welcome):
  * Have a type modifier (eg, like restrict), that specifies that
  operations on data of this type are preserving value dependencies:
 
 So I'm not violently opposed, but I think the upsides are not great.
 Note that my earlier suggestion to use restrict wasn't because I
 believed the annotation itself would be visible, but basically just as
 a legalistic promise to the compiler that *if* it found an alias, then
 it didn't need to worry about ordering. So to me, that type modifier
 was about conceptual guarantees, not about actual value chains.
 
 Anyway, the reason I don't believe any type modifier (and
 [[carries_dependency]] is basically just that) is worth it is simply
 that it adds a real burden on the programmer, without actually giving
 the programmer any real upside:
 
 Within a single function, the compiler already sees that mo_consume
 source, and so doing a type-based restriction doesn't really help. The
 information is already there, without any burden on the programmer.

I think it's not just a 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote:
 Your proposal looks quite promising at first glance.  But rather than
 try and comment on it immediately, I am going to take a number of uses of
 RCU from the Linux kernel and apply your proposal to them, then respond
 with the results
 
 Fair enough?

Sure.  Thanks for doing the cross-check!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which pointer
  you use.  At least as long as the rcu_assign_pointer() for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?
 
 I think that it might be worth pointing out as an example, and saying
 that code like
 
p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;
 
 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the compiler
 can decide that the values are interchangeable.

The wording I proposed would make the p dereference have a value
dependency unless X and Y would somehow restrict p and q.  The reasoning
is that if the atomic loads return potentially more than one value, then
even if we find out that two such loads did return the same value, we
still don't know what the exact value was.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
 +oDo not use the results from the boolean  and || when
 + dereferencing.  For example, the following (rather improbable)
 + code is buggy:
 +
 + int a[2];
 + int index;
 + int force_zero_index = 1;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
 +
 + The reason this is buggy is that  and || are often compiled
 + using branches.  While weak-memory machines such as ARM or PowerPC
 + do order stores after such branches, they can speculate loads,
 + which can result in misordering bugs.
 +
 +oDo not use the results from relational operators (==, !=,
 + , =, , or =) when dereferencing.  For example,
 + the following (quite strange) code is buggy:
 +
 + int a[2];
 + int index;
 + int flip_index = 0;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1 != flip_index];  /* BUGGY!!! */
 +
 + As before, the reason this is buggy is that relational operators
 + are often compiled using branches.  And as before, although
 + weak-memory machines such as ARM or PowerPC do order stores
 + after such branches, but can speculate loads, which can again
 + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).

I don't think the wording is flawed.  We could raise the requirement of
having more than one value left for r1 to having more than N with N  1
values left, but the fundamental problem remains in that a compiler
could try to generate a (big) switch statement.

Instead, I think that this indicates that the value_dep_preserving type
modifier would be useful: It would tell the compiler that it shouldn't
transform this into a branch in this case, yet allow that optimization
for all other code.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Paul E. McKenney
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
 
 On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
  +o  Do not use the results from the boolean  and || when
  +   dereferencing.  For example, the following (rather improbable)
  +   code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int force_zero_index = 1;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
  +
  +   The reason this is buggy is that  and || are often compiled
  +   using branches.  While weak-memory machines such as ARM or PowerPC
  +   do order stores after such branches, they can speculate loads,
  +   which can result in misordering bugs.
  +
  +o  Do not use the results from relational operators (==, !=,
  +   , =, , or =) when dereferencing.  For example,
  +   the following (quite strange) code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int flip_index = 0;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
  +
  +   As before, the reason this is buggy is that relational operators
  +   are often compiled using branches.  And as before, although
  +   weak-memory machines such as ARM or PowerPC do order stores
  +   after such branches, but can speculate loads, which can again
  +   result in misordering bugs.
 
 Those two would be allowed by the wording I have recently proposed,
 AFAICS.  r1 != flip_index would result in two possible values (unless
 there are further constraints due to the type of r1 and the values that
 flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)

 I don't think the wording is flawed.  We could raise the requirement of
 having more than one value left for r1 to having more than N with N  1
 values left, but the fundamental problem remains in that a compiler
 could try to generate a (big) switch statement.
 
 Instead, I think that this indicates that the value_dep_preserving type
 modifier would be useful: It would tell the compiler that it shouldn't
 transform this into a branch in this case, yet allow that optimization
 for all other code.

Understood!

BTW, my current task is generating examples using the value_dep_preserving
type for RCU-protected array indexes.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
 On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
  Hi Paul,
 
  On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
  wrote:
   On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
   On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 3.  The comparison was against another RCU-protected pointer,
 where that other pointer was properly fetched using one
 of the RCU primitives.  Here it doesn't matter which pointer
 you use.  At least as long as the rcu_assign_pointer() for
 that other pointer happened after the last update to the
 pointed-to structure.

 I am a bit nervous about #3.  Any thoughts on it?
   
I think that it might be worth pointing out as an example, and saying
that code like
   
   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p-val;
   
then the access of p-val is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.
   
I cannot for the life of me come up with a situation where this would
matter, though. If X contains a fence, then that fence will be a
stronger ordering than anything the consume through p would
guarantee anyway. And if X does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through p is through p or q is kind of
irrelevant. No?
  
   I can make a contrived litmus test for it, but you are right, the only
   time you can see it happen is when X has no barriers, in which case
   you don't have any ordering anyway -- both the compiler and the CPU can
   reorder the loads into p and q, and the read from p-val can, as you 
   say,
   come from either pointer.
  
   For whatever it is worth, hear is the litmus test:
  
   T1:   p = kmalloc(...);
 if (p == NULL)
 deal_with_it();
 p-a = 42;  /* Each field in its own cache line. */
 p-b = 43;
 p-c = 44;
 atomic_store_explicit(gp1, p, memory_order_release);
 p-b = 143;
 p-c = 144;
 atomic_store_explicit(gp2, p, memory_order_release);
  
   T2:   p = atomic_load_explicit(gp2, memory_order_consume);
 r1 = p-b;  /* Guaranteed to get 143. */
 q = atomic_load_explicit(gp1, memory_order_consume);
 if (p == q) {
 /* The compiler decides that q-c is same as p-c. */
 r2 = p-c; /* Could get 44 on weakly order system. */
 }
  
   The loads from gp1 and gp2 are, as you say, unordered, so you get what
   you get.
  
   And publishing a structure via one RCU-protected pointer, updating it,
   then publishing it via another pointer seems to me to be asking for
   trouble anyway.  If you really want to do something like that and still
   see consistency across all the fields in the structure, please put a 
   lock
   in the structure and use it to guard updates and accesses to those 
   fields.
  
   And here is a patch documenting the restrictions for the current Linux
   kernel.  The rules change a bit due to rcu_dereference() acting a bit
   differently than atomic_load_explicit(p, memory_order_consume).
  
   Thoughts?
 
  That might serve as informal documentation for linux kernel
  programmers about the bounds on the optimisations that you expect
  compilers to do for common-case RCU code - and I guess that's what you
  intend it to be for.   But I don't see how one can make it precise
  enough to serve as a language definition, so that compiler people
  could confidently say yes, we respect that, which I guess is what
  you really need.  As a useful criterion, we should aim for something
  precise enough that in a verified-compiler context you can
  mathematically prove that the compiler will satisfy it  (even though
  that won't happen anytime soon for GCC), and that analysis tool
  authors can actually know what they're working with.   All this stuff
  about you should avoid cancellation, and avoid masking with just a
  small number of bits is just too vague.
 
  Understood, and yes, this is intended to document current compiler
  behavior for the Linux kernel community.  It would not make sense to show
  it to the C11 or C++11 communities, except perhaps as an informational
  piece on current practice.
 
  The basic problem is that the compiler may be doing sophisticated
  reasoning with a bunch of non-local knowledge that it's deduced from
  the code, neither of which are well-understood, and here we have to
  identify some envelope, expressive 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel
On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
  X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
  
  On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
   +oDo not use the results from the boolean  and || when
   + dereferencing.  For example, the following (rather improbable)
   + code is buggy:
   +
   + int a[2];
   + int index;
   + int force_zero_index = 1;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
   +
   + The reason this is buggy is that  and || are often compiled
   + using branches.  While weak-memory machines such as ARM or PowerPC
   + do order stores after such branches, they can speculate loads,
   + which can result in misordering bugs.
   +
   +oDo not use the results from relational operators (==, !=,
   + , =, , or =) when dereferencing.  For example,
   + the following (quite strange) code is buggy:
   +
   + int a[2];
   + int index;
   + int flip_index = 0;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1 != flip_index];  /* BUGGY!!! */
   +
   + As before, the reason this is buggy is that relational operators
   + are often compiled using branches.  And as before, although
   + weak-memory machines such as ARM or PowerPC do order stores
   + after such branches, but can speculate loads, which can again
   + result in misordering bugs.
  
  Those two would be allowed by the wording I have recently proposed,
  AFAICS.  r1 != flip_index would result in two possible values (unless
  there are further constraints due to the type of r1 and the values that
  flip_index can have).
 
 And I am OK with the value_dep_preserving type providing more/better
 guarantees than we get by default from current compilers.
 
 One question, though.  Suppose that the code did not want a value
 dependency to be tracked through a comparison operator.  What does
 the developer do in that case?  (The reason I ask is that I have
 not yet found a use case in the Linux kernel that expects a value
 dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney
On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote:
> On 2 March 2014 23:20, Paul E. McKenney  wrote:
> > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney 
> >> >>  wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> >  wrote:
> >> >> >> > >
> >> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> >> > > where that other pointer was properly fetched using one
> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
> >> >> >> > > pointer
> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
> >> >> >> > > for
> >> >> >> > > that other pointer happened after the last update to the
> >> >> >> > > pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and 
> >> >> >> > saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >p = atomic_read(consume);
> >> >> >> >X;
> >> >> >> >q = atomic_read(consume);
> >> >> >> >Y;
> >> >> >> >if (p == q)
> >> >> >> > data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the 
> >> >> >> > compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this 
> >> >> >> > would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the 
> >> >> >> only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
> >> >> >> can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> >> say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>   if (p == NULL)
> >> >> >>   deal_with_it();
> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >> >>   p->b = 43;
> >> >> >>   p->c = 44;
> >> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >> >>   p->b = 143;
> >> >> >>   p->c = 144;
> >> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
> >> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>   q = atomic_load_explicit(, memory_order_consume);
> >> >> >>   if (p == q) {
> >> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>   }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get 
> >> >> >> what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating 
> >> >> >> it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and 
> >> >> >> still
> >> >> >> see consistency across all the fields in the structure, please put a 
> >> >> >> lock
> >> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> >> fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is what
> >> >> you 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Peter Sewell
On 2 March 2014 23:20, Paul E. McKenney  wrote:
> On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
>> On 1 March 2014 08:03, Paul E. McKenney  wrote:
>> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> Hi Paul,
>> >>
>> >> On 28 February 2014 18:50, Paul E. McKenney  
>> >> wrote:
>> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >  wrote:
>> >> >> > >
>> >> >> > > 3.  The comparison was against another RCU-protected pointer,
>> >> >> > > where that other pointer was properly fetched using one
>> >> >> > > of the RCU primitives.  Here it doesn't matter which 
>> >> >> > > pointer
>> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
>> >> >> > > that other pointer happened after the last update to the
>> >> >> > > pointed-to structure.
>> >> >> > >
>> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >
>> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> > that code like
>> >> >> >
>> >> >> >p = atomic_read(consume);
>> >> >> >X;
>> >> >> >q = atomic_read(consume);
>> >> >> >Y;
>> >> >> >if (p == q)
>> >> >> > data = p->val;
>> >> >> >
>> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> > can decide that the values are interchangeable.
>> >> >> >
>> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> > irrelevant. No?
>> >> >>
>> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> reorder the loads into p and q, and the read from p->val can, as you 
>> >> >> say,
>> >> >> come from either pointer.
>> >> >>
>> >> >> For whatever it is worth, hear is the litmus test:
>> >> >>
>> >> >> T1:   p = kmalloc(...);
>> >> >>   if (p == NULL)
>> >> >>   deal_with_it();
>> >> >>   p->a = 42;  /* Each field in its own cache line. */
>> >> >>   p->b = 43;
>> >> >>   p->c = 44;
>> >> >>   atomic_store_explicit(, p, memory_order_release);
>> >> >>   p->b = 143;
>> >> >>   p->c = 144;
>> >> >>   atomic_store_explicit(, p, memory_order_release);
>> >> >>
>> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
>> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >> >>   q = atomic_load_explicit(, memory_order_consume);
>> >> >>   if (p == q) {
>> >> >>   /* The compiler decides that q->c is same as p->c. */
>> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >>   }
>> >> >>
>> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> you get.
>> >> >>
>> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> see consistency across all the fields in the structure, please put a 
>> >> >> lock
>> >> >> in the structure and use it to guard updates and accesses to those 
>> >> >> fields.
>> >> >
>> >> > And here is a patch documenting the restrictions for the current Linux
>> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> > differently than atomic_load_explicit(, memory_order_consume).
>> >> >
>> >> > Thoughts?
>> >>
>> >> That might serve as informal documentation for linux kernel
>> >> programmers about the bounds on the optimisations that you expect
>> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> intend it to be for.   But I don't see how one can make it precise
>> >> enough to serve as a language definition, so that compiler people
>> >> could confidently say "yes, we respect that", which I guess is what
>> >> you really need.  As a useful criterion, we should aim for something
>> >> precise enough that in a verified-compiler context you can
>> >> mathematically prove that the compiler will satisfy it  (even though
>> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> authors can actually know what they're working with.   All this stuff > 
>> >> >> about "you 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney
On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney  
> >> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >  wrote:
> >> >> > >
> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> > > where that other pointer was properly fetched using one
> >> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> >> > > that other pointer happened after the last update to the
> >> >> > > pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >p = atomic_read(consume);
> >> >> >X;
> >> >> >q = atomic_read(consume);
> >> >> >Y;
> >> >> >if (p == q)
> >> >> > data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>   if (p == NULL)
> >> >>   deal_with_it();
> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >>   p->b = 43;
> >> >>   p->c = 44;
> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >>   p->b = 143;
> >> >>   p->c = 144;
> >> >>   atomic_store_explicit(, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(, memory_order_consume);
> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >>   q = atomic_load_explicit(, memory_order_consume);
> >> >>   if (p == q) {
> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>   }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a 
> >> >> lock
> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff > >> 
> >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is intended to document 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Peter Sewell
On 1 March 2014 08:03, Paul E. McKenney  wrote:
> On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> Hi Paul,
>>
>> On 28 February 2014 18:50, Paul E. McKenney  
>> wrote:
>> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >  wrote:
>> >> > >
>> >> > > 3.  The comparison was against another RCU-protected pointer,
>> >> > > where that other pointer was properly fetched using one
>> >> > > of the RCU primitives.  Here it doesn't matter which pointer
>> >> > > you use.  At least as long as the rcu_assign_pointer() for
>> >> > > that other pointer happened after the last update to the
>> >> > > pointed-to structure.
>> >> > >
>> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >
>> >> > I think that it might be worth pointing out as an example, and saying
>> >> > that code like
>> >> >
>> >> >p = atomic_read(consume);
>> >> >X;
>> >> >q = atomic_read(consume);
>> >> >Y;
>> >> >if (p == q)
>> >> > data = p->val;
>> >> >
>> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> > can decide that the values are interchangeable.
>> >> >
>> >> > I cannot for the life of me come up with a situation where this would
>> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> > stronger ordering than anything the consume through "p" would
>> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> > ordering to the access through "p" is through p or q is kind of
>> >> > irrelevant. No?
>> >>
>> >> I can make a contrived litmus test for it, but you are right, the only
>> >> time you can see it happen is when X has no barriers, in which case
>> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> reorder the loads into p and q, and the read from p->val can, as you say,
>> >> come from either pointer.
>> >>
>> >> For whatever it is worth, hear is the litmus test:
>> >>
>> >> T1:   p = kmalloc(...);
>> >>   if (p == NULL)
>> >>   deal_with_it();
>> >>   p->a = 42;  /* Each field in its own cache line. */
>> >>   p->b = 43;
>> >>   p->c = 44;
>> >>   atomic_store_explicit(, p, memory_order_release);
>> >>   p->b = 143;
>> >>   p->c = 144;
>> >>   atomic_store_explicit(, p, memory_order_release);
>> >>
>> >> T2:   p = atomic_load_explicit(, memory_order_consume);
>> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >>   q = atomic_load_explicit(, memory_order_consume);
>> >>   if (p == q) {
>> >>   /* The compiler decides that q->c is same as p->c. */
>> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >>   }
>> >>
>> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> you get.
>> >>
>> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> then publishing it via another pointer seems to me to be asking for
>> >> trouble anyway.  If you really want to do something like that and still
>> >> see consistency across all the fields in the structure, please put a lock
>> >> in the structure and use it to guard updates and accesses to those fields.
>> >
>> > And here is a patch documenting the restrictions for the current Linux
>> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> > differently than atomic_load_explicit(, memory_order_consume).
>> >
>> > Thoughts?
>>
>> That might serve as informal documentation for linux kernel
>> programmers about the bounds on the optimisations that you expect
>> compilers to do for common-case RCU code - and I guess that's what you
>> intend it to be for.   But I don't see how one can make it precise
>> enough to serve as a language definition, so that compiler people
>> could confidently say "yes, we respect that", which I guess is what
>> you really need.  As a useful criterion, we should aim for something
>> precise enough that in a verified-compiler context you can
>> mathematically prove that the compiler will satisfy it  (even though
>> that won't happen anytime soon for GCC), and that analysis tool
>> authors can actually know what they're working with.   All this stuff
>> about "you should avoid cancellation", and "avoid masking with just a
>> small number of bits" is just too vague.
>
> Understood, and yes, this is intended to document current compiler
> behavior for the Linux kernel community.  It would not make sense to show
> it to the C11 or C++11 communities, except perhaps as an informational
> piece on current practice.
>
>> The basic problem is that the compiler may be doing sophisticated
>> reasoning with a bunch 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Peter Sewell
On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
 Hi Paul,

 On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
 wrote:
  On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
  On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
   On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.
   
I am a bit nervous about #3.  Any thoughts on it?
  
   I think that it might be worth pointing out as an example, and saying
   that code like
  
  p = atomic_read(consume);
  X;
  q = atomic_read(consume);
  Y;
  if (p == q)
   data = p-val;
  
   then the access of p-val is constrained to be data-dependent on
   *either* p or q, but you can't really tell which, since the compiler
   can decide that the values are interchangeable.
  
   I cannot for the life of me come up with a situation where this would
   matter, though. If X contains a fence, then that fence will be a
   stronger ordering than anything the consume through p would
   guarantee anyway. And if X does *not* contain a fence, then the
   atomic reads of p and q are unordered *anyway*, so then whether the
   ordering to the access through p is through p or q is kind of
   irrelevant. No?
 
  I can make a contrived litmus test for it, but you are right, the only
  time you can see it happen is when X has no barriers, in which case
  you don't have any ordering anyway -- both the compiler and the CPU can
  reorder the loads into p and q, and the read from p-val can, as you say,
  come from either pointer.
 
  For whatever it is worth, hear is the litmus test:
 
  T1:   p = kmalloc(...);
if (p == NULL)
deal_with_it();
p-a = 42;  /* Each field in its own cache line. */
p-b = 43;
p-c = 44;
atomic_store_explicit(gp1, p, memory_order_release);
p-b = 143;
p-c = 144;
atomic_store_explicit(gp2, p, memory_order_release);
 
  T2:   p = atomic_load_explicit(gp2, memory_order_consume);
r1 = p-b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q-c is same as p-c. */
r2 = p-c; /* Could get 44 on weakly order system. */
}
 
  The loads from gp1 and gp2 are, as you say, unordered, so you get what
  you get.
 
  And publishing a structure via one RCU-protected pointer, updating it,
  then publishing it via another pointer seems to me to be asking for
  trouble anyway.  If you really want to do something like that and still
  see consistency across all the fields in the structure, please put a lock
  in the structure and use it to guard updates and accesses to those fields.
 
  And here is a patch documenting the restrictions for the current Linux
  kernel.  The rules change a bit due to rcu_dereference() acting a bit
  differently than atomic_load_explicit(p, memory_order_consume).
 
  Thoughts?

 That might serve as informal documentation for linux kernel
 programmers about the bounds on the optimisations that you expect
 compilers to do for common-case RCU code - and I guess that's what you
 intend it to be for.   But I don't see how one can make it precise
 enough to serve as a language definition, so that compiler people
 could confidently say yes, we respect that, which I guess is what
 you really need.  As a useful criterion, we should aim for something
 precise enough that in a verified-compiler context you can
 mathematically prove that the compiler will satisfy it  (even though
 that won't happen anytime soon for GCC), and that analysis tool
 authors can actually know what they're working with.   All this stuff
 about you should avoid cancellation, and avoid masking with just a
 small number of bits is just too vague.

 Understood, and yes, this is intended to document current compiler
 behavior for the Linux kernel community.  It would not make sense to show
 it to the C11 or C++11 communities, except perhaps as an informational
 piece on current practice.

 The basic problem is that the compiler may be doing sophisticated
 reasoning with a bunch of non-local knowledge that it's deduced from
 the code, neither of which are well-understood, and here we have to
 identify some envelope, expressive enough for RCU idioms, in which
 that reasoning doesn't allow data/address dependencies to be removed
 (and hence the hardware guarantee about them will be maintained at the
 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney
On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
 On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
  Hi Paul,
 
  On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
  wrote:
   On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
   On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 3.  The comparison was against another RCU-protected pointer,
 where that other pointer was properly fetched using one
 of the RCU primitives.  Here it doesn't matter which pointer
 you use.  At least as long as the rcu_assign_pointer() for
 that other pointer happened after the last update to the
 pointed-to structure.

 I am a bit nervous about #3.  Any thoughts on it?
   
I think that it might be worth pointing out as an example, and saying
that code like
   
   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p-val;
   
then the access of p-val is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.
   
I cannot for the life of me come up with a situation where this would
matter, though. If X contains a fence, then that fence will be a
stronger ordering than anything the consume through p would
guarantee anyway. And if X does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through p is through p or q is kind of
irrelevant. No?
  
   I can make a contrived litmus test for it, but you are right, the only
   time you can see it happen is when X has no barriers, in which case
   you don't have any ordering anyway -- both the compiler and the CPU can
   reorder the loads into p and q, and the read from p-val can, as you 
   say,
   come from either pointer.
  
   For whatever it is worth, hear is the litmus test:
  
   T1:   p = kmalloc(...);
 if (p == NULL)
 deal_with_it();
 p-a = 42;  /* Each field in its own cache line. */
 p-b = 43;
 p-c = 44;
 atomic_store_explicit(gp1, p, memory_order_release);
 p-b = 143;
 p-c = 144;
 atomic_store_explicit(gp2, p, memory_order_release);
  
   T2:   p = atomic_load_explicit(gp2, memory_order_consume);
 r1 = p-b;  /* Guaranteed to get 143. */
 q = atomic_load_explicit(gp1, memory_order_consume);
 if (p == q) {
 /* The compiler decides that q-c is same as p-c. */
 r2 = p-c; /* Could get 44 on weakly order system. */
 }
  
   The loads from gp1 and gp2 are, as you say, unordered, so you get what
   you get.
  
   And publishing a structure via one RCU-protected pointer, updating it,
   then publishing it via another pointer seems to me to be asking for
   trouble anyway.  If you really want to do something like that and still
   see consistency across all the fields in the structure, please put a 
   lock
   in the structure and use it to guard updates and accesses to those 
   fields.
  
   And here is a patch documenting the restrictions for the current Linux
   kernel.  The rules change a bit due to rcu_dereference() acting a bit
   differently than atomic_load_explicit(p, memory_order_consume).
  
   Thoughts?
 
  That might serve as informal documentation for linux kernel
  programmers about the bounds on the optimisations that you expect
  compilers to do for common-case RCU code - and I guess that's what you
  intend it to be for.   But I don't see how one can make it precise
  enough to serve as a language definition, so that compiler people
  could confidently say yes, we respect that, which I guess is what
  you really need.  As a useful criterion, we should aim for something
  precise enough that in a verified-compiler context you can
  mathematically prove that the compiler will satisfy it  (even though
  that won't happen anytime soon for GCC), and that analysis tool
  authors can actually know what they're working with.   All this stuff   
  about you should avoid cancellation, and avoid masking with just a
  small number of bits is just too vague.
 
  Understood, and yes, this is intended to document current compiler
  behavior for the Linux kernel community.  It would not make sense to show
  it to the C11 or C++11 communities, except perhaps as an informational
  piece on current practice.
 
  The basic problem is that the compiler may be doing sophisticated
  reasoning with a bunch of non-local knowledge that it's deduced from
  the code, neither of which are well-understood, and here we have to
  identify some envelope, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Peter Sewell
On 2 March 2014 23:20, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
 On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
  Hi Paul,
 
  On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
  wrote:
   On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
   On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 3.  The comparison was against another RCU-protected pointer,
 where that other pointer was properly fetched using one
 of the RCU primitives.  Here it doesn't matter which 
 pointer
 you use.  At least as long as the rcu_assign_pointer() for
 that other pointer happened after the last update to the
 pointed-to structure.

 I am a bit nervous about #3.  Any thoughts on it?
   
I think that it might be worth pointing out as an example, and saying
that code like
   
   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p-val;
   
then the access of p-val is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.
   
I cannot for the life of me come up with a situation where this would
matter, though. If X contains a fence, then that fence will be a
stronger ordering than anything the consume through p would
guarantee anyway. And if X does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through p is through p or q is kind of
irrelevant. No?
  
   I can make a contrived litmus test for it, but you are right, the only
   time you can see it happen is when X has no barriers, in which case
   you don't have any ordering anyway -- both the compiler and the CPU can
   reorder the loads into p and q, and the read from p-val can, as you 
   say,
   come from either pointer.
  
   For whatever it is worth, hear is the litmus test:
  
   T1:   p = kmalloc(...);
 if (p == NULL)
 deal_with_it();
 p-a = 42;  /* Each field in its own cache line. */
 p-b = 43;
 p-c = 44;
 atomic_store_explicit(gp1, p, memory_order_release);
 p-b = 143;
 p-c = 144;
 atomic_store_explicit(gp2, p, memory_order_release);
  
   T2:   p = atomic_load_explicit(gp2, memory_order_consume);
 r1 = p-b;  /* Guaranteed to get 143. */
 q = atomic_load_explicit(gp1, memory_order_consume);
 if (p == q) {
 /* The compiler decides that q-c is same as p-c. */
 r2 = p-c; /* Could get 44 on weakly order system. */
 }
  
   The loads from gp1 and gp2 are, as you say, unordered, so you get what
   you get.
  
   And publishing a structure via one RCU-protected pointer, updating it,
   then publishing it via another pointer seems to me to be asking for
   trouble anyway.  If you really want to do something like that and still
   see consistency across all the fields in the structure, please put a 
   lock
   in the structure and use it to guard updates and accesses to those 
   fields.
  
   And here is a patch documenting the restrictions for the current Linux
   kernel.  The rules change a bit due to rcu_dereference() acting a bit
   differently than atomic_load_explicit(p, memory_order_consume).
  
   Thoughts?
 
  That might serve as informal documentation for linux kernel
  programmers about the bounds on the optimisations that you expect
  compilers to do for common-case RCU code - and I guess that's what you
  intend it to be for.   But I don't see how one can make it precise
  enough to serve as a language definition, so that compiler people
  could confidently say yes, we respect that, which I guess is what
  you really need.  As a useful criterion, we should aim for something
  precise enough that in a verified-compiler context you can
  mathematically prove that the compiler will satisfy it  (even though
  that won't happen anytime soon for GCC), and that analysis tool
  authors can actually know what they're working with.   All this stuff  
   about you should avoid cancellation, and avoid masking with just a
  small number of bits is just too vague.
 
  Understood, and yes, this is intended to document current compiler
  behavior for the Linux kernel community.  It would not make sense to show
  it to the C11 or C++11 communities, except perhaps as an informational
  piece on current practice.
 
  The basic problem is that the compiler may be doing sophisticated
  reasoning with a bunch of non-local knowledge that it's deduced from
  the code, neither 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney
On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote:
 On 2 March 2014 23:20, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
  On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
   On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
   Hi Paul,
  
   On 28 February 2014 18:50, Paul E. McKenney 
   paul...@linux.vnet.ibm.com wrote:
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which 
  pointer
  you use.  At least as long as the rcu_assign_pointer() 
  for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?

 I think that it might be worth pointing out as an example, and 
 saying
 that code like

p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;

 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the 
 compiler
 can decide that the values are interchangeable.

 I cannot for the life of me come up with a situation where this 
 would
 matter, though. If X contains a fence, then that fence will be a
 stronger ordering than anything the consume through p would
 guarantee anyway. And if X does *not* contain a fence, then the
 atomic reads of p and q are unordered *anyway*, so then whether the
 ordering to the access through p is through p or q is kind of
 irrelevant. No?
   
I can make a contrived litmus test for it, but you are right, the 
only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU 
can
reorder the loads into p and q, and the read from p-val can, as you 
say,
come from either pointer.
   
For whatever it is worth, hear is the litmus test:
   
T1:   p = kmalloc(...);
  if (p == NULL)
  deal_with_it();
  p-a = 42;  /* Each field in its own cache line. */
  p-b = 43;
  p-c = 44;
  atomic_store_explicit(gp1, p, memory_order_release);
  p-b = 143;
  p-c = 144;
  atomic_store_explicit(gp2, p, memory_order_release);
   
T2:   p = atomic_load_explicit(gp2, memory_order_consume);
  r1 = p-b;  /* Guaranteed to get 143. */
  q = atomic_load_explicit(gp1, memory_order_consume);
  if (p == q) {
  /* The compiler decides that q-c is same as p-c. */
  r2 = p-c; /* Could get 44 on weakly order system. */
  }
   
The loads from gp1 and gp2 are, as you say, unordered, so you get 
what
you get.
   
And publishing a structure via one RCU-protected pointer, updating 
it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and 
still
see consistency across all the fields in the structure, please put a 
lock
in the structure and use it to guard updates and accesses to those 
fields.
   
And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(p, memory_order_consume).
   
Thoughts?
  
   That might serve as informal documentation for linux kernel
   programmers about the bounds on the optimisations that you expect
   compilers to do for common-case RCU code - and I guess that's what you
   intend it to be for.   But I don't see how one can make it precise
   enough to serve as a language definition, so that compiler people
   could confidently say yes, we respect that, which I guess is what
   you really need.  As a useful criterion, we should aim for something
   precise enough that in a verified-compiler context you can
   mathematically prove that the compiler will satisfy it  (even though
   that won't happen anytime soon for GCC), and that analysis tool
   authors can actually know what they're working with.   All this stuff  
about you should avoid cancellation, and avoid masking with just a
   small number of bits is just too vague.
  
   Understood, and yes, this is intended to document current compiler
   behavior for the Linux kernel community.  It would not make sense to show
   it to the C11 or C++11 communities, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Paul E. McKenney
On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> Hi Paul,
> 
> On 28 February 2014 18:50, Paul E. McKenney  
> wrote:
> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >  wrote:
> >> > >
> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> > > where that other pointer was properly fetched using one
> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> > > that other pointer happened after the last update to the
> >> > > pointed-to structure.
> >> > >
> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >
> >> > I think that it might be worth pointing out as an example, and saying
> >> > that code like
> >> >
> >> >p = atomic_read(consume);
> >> >X;
> >> >q = atomic_read(consume);
> >> >Y;
> >> >if (p == q)
> >> > data = p->val;
> >> >
> >> > then the access of "p->val" is constrained to be data-dependent on
> >> > *either* p or q, but you can't really tell which, since the compiler
> >> > can decide that the values are interchangeable.
> >> >
> >> > I cannot for the life of me come up with a situation where this would
> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> > stronger ordering than anything the consume through "p" would
> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> > ordering to the access through "p" is through p or q is kind of
> >> > irrelevant. No?
> >>
> >> I can make a contrived litmus test for it, but you are right, the only
> >> time you can see it happen is when X has no barriers, in which case
> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> come from either pointer.
> >>
> >> For whatever it is worth, hear is the litmus test:
> >>
> >> T1:   p = kmalloc(...);
> >>   if (p == NULL)
> >>   deal_with_it();
> >>   p->a = 42;  /* Each field in its own cache line. */
> >>   p->b = 43;
> >>   p->c = 44;
> >>   atomic_store_explicit(, p, memory_order_release);
> >>   p->b = 143;
> >>   p->c = 144;
> >>   atomic_store_explicit(, p, memory_order_release);
> >>
> >> T2:   p = atomic_load_explicit(, memory_order_consume);
> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >>   q = atomic_load_explicit(, memory_order_consume);
> >>   if (p == q) {
> >>   /* The compiler decides that q->c is same as p->c. */
> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >>   }
> >>
> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> you get.
> >>
> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> then publishing it via another pointer seems to me to be asking for
> >> trouble anyway.  If you really want to do something like that and still
> >> see consistency across all the fields in the structure, please put a lock
> >> in the structure and use it to guard updates and accesses to those fields.
> >
> > And here is a patch documenting the restrictions for the current Linux
> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> > differently than atomic_load_explicit(, memory_order_consume).
> >
> > Thoughts?
> 
> That might serve as informal documentation for linux kernel
> programmers about the bounds on the optimisations that you expect
> compilers to do for common-case RCU code - and I guess that's what you
> intend it to be for.   But I don't see how one can make it precise
> enough to serve as a language definition, so that compiler people
> could confidently say "yes, we respect that", which I guess is what
> you really need.  As a useful criterion, we should aim for something
> precise enough that in a verified-compiler context you can
> mathematically prove that the compiler will satisfy it  (even though
> that won't happen anytime soon for GCC), and that analysis tool
> authors can actually know what they're working with.   All this stuff
> about "you should avoid cancellation", and "avoid masking with just a
> small number of bits" is just too vague.

Understood, and yes, this is intended to document current compiler
behavior for the Linux kernel community.  It would not make sense to show
it to the C11 or C++11 communities, except perhaps as an informational
piece on current practice.

> The basic problem is that the compiler may be doing sophisticated
> reasoning with a bunch of non-local knowledge that it's deduced from
> the code, neither of which are well-understood, and here we have to
> identify some envelope, expressive 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Peter Sewell
Hi Paul,

On 28 February 2014 18:50, Paul E. McKenney  wrote:
> On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >  wrote:
>> > >
>> > > 3.  The comparison was against another RCU-protected pointer,
>> > > where that other pointer was properly fetched using one
>> > > of the RCU primitives.  Here it doesn't matter which pointer
>> > > you use.  At least as long as the rcu_assign_pointer() for
>> > > that other pointer happened after the last update to the
>> > > pointed-to structure.
>> > >
>> > > I am a bit nervous about #3.  Any thoughts on it?
>> >
>> > I think that it might be worth pointing out as an example, and saying
>> > that code like
>> >
>> >p = atomic_read(consume);
>> >X;
>> >q = atomic_read(consume);
>> >Y;
>> >if (p == q)
>> > data = p->val;
>> >
>> > then the access of "p->val" is constrained to be data-dependent on
>> > *either* p or q, but you can't really tell which, since the compiler
>> > can decide that the values are interchangeable.
>> >
>> > I cannot for the life of me come up with a situation where this would
>> > matter, though. If "X" contains a fence, then that fence will be a
>> > stronger ordering than anything the consume through "p" would
>> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> > atomic reads of p and q are unordered *anyway*, so then whether the
>> > ordering to the access through "p" is through p or q is kind of
>> > irrelevant. No?
>>
>> I can make a contrived litmus test for it, but you are right, the only
>> time you can see it happen is when X has no barriers, in which case
>> you don't have any ordering anyway -- both the compiler and the CPU can
>> reorder the loads into p and q, and the read from p->val can, as you say,
>> come from either pointer.
>>
>> For whatever it is worth, hear is the litmus test:
>>
>> T1:   p = kmalloc(...);
>>   if (p == NULL)
>>   deal_with_it();
>>   p->a = 42;  /* Each field in its own cache line. */
>>   p->b = 43;
>>   p->c = 44;
>>   atomic_store_explicit(, p, memory_order_release);
>>   p->b = 143;
>>   p->c = 144;
>>   atomic_store_explicit(, p, memory_order_release);
>>
>> T2:   p = atomic_load_explicit(, memory_order_consume);
>>   r1 = p->b;  /* Guaranteed to get 143. */
>>   q = atomic_load_explicit(, memory_order_consume);
>>   if (p == q) {
>>   /* The compiler decides that q->c is same as p->c. */
>>   r2 = p->c; /* Could get 44 on weakly order system. */
>>   }
>>
>> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> you get.
>>
>> And publishing a structure via one RCU-protected pointer, updating it,
>> then publishing it via another pointer seems to me to be asking for
>> trouble anyway.  If you really want to do something like that and still
>> see consistency across all the fields in the structure, please put a lock
>> in the structure and use it to guard updates and accesses to those fields.
>
> And here is a patch documenting the restrictions for the current Linux
> kernel.  The rules change a bit due to rcu_dereference() acting a bit
> differently than atomic_load_explicit(, memory_order_consume).
>
> Thoughts?

That might serve as informal documentation for linux kernel
programmers about the bounds on the optimisations that you expect
compilers to do for common-case RCU code - and I guess that's what you
intend it to be for.   But I don't see how one can make it precise
enough to serve as a language definition, so that compiler people
could confidently say "yes, we respect that", which I guess is what
you really need.  As a useful criterion, we should aim for something
precise enough that in a verified-compiler context you can
mathematically prove that the compiler will satisfy it  (even though
that won't happen anytime soon for GCC), and that analysis tool
authors can actually know what they're working with.   All this stuff
about "you should avoid cancellation", and "avoid masking with just a
small number of bits" is just too vague.

The basic problem is that the compiler may be doing sophisticated
reasoning with a bunch of non-local knowledge that it's deduced from
the code, neither of which are well-understood, and here we have to
identify some envelope, expressive enough for RCU idioms, in which
that reasoning doesn't allow data/address dependencies to be removed
(and hence the hardware guarantee about them will be maintained at the
source level).

The C11 syntactic notion of dependency, whatever its faults, was at
least precise, could be reasoned about locally (just looking at the
syntactic code in question), and did do that.  The fact that current
compilers do optimisations that remove dependencies and will likely
have many bugs at present is besides the 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Peter Sewell
Hi Paul,

On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
  On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  
   3.  The comparison was against another RCU-protected pointer,
   where that other pointer was properly fetched using one
   of the RCU primitives.  Here it doesn't matter which pointer
   you use.  At least as long as the rcu_assign_pointer() for
   that other pointer happened after the last update to the
   pointed-to structure.
  
   I am a bit nervous about #3.  Any thoughts on it?
 
  I think that it might be worth pointing out as an example, and saying
  that code like
 
 p = atomic_read(consume);
 X;
 q = atomic_read(consume);
 Y;
 if (p == q)
  data = p-val;
 
  then the access of p-val is constrained to be data-dependent on
  *either* p or q, but you can't really tell which, since the compiler
  can decide that the values are interchangeable.
 
  I cannot for the life of me come up with a situation where this would
  matter, though. If X contains a fence, then that fence will be a
  stronger ordering than anything the consume through p would
  guarantee anyway. And if X does *not* contain a fence, then the
  atomic reads of p and q are unordered *anyway*, so then whether the
  ordering to the access through p is through p or q is kind of
  irrelevant. No?

 I can make a contrived litmus test for it, but you are right, the only
 time you can see it happen is when X has no barriers, in which case
 you don't have any ordering anyway -- both the compiler and the CPU can
 reorder the loads into p and q, and the read from p-val can, as you say,
 come from either pointer.

 For whatever it is worth, hear is the litmus test:

 T1:   p = kmalloc(...);
   if (p == NULL)
   deal_with_it();
   p-a = 42;  /* Each field in its own cache line. */
   p-b = 43;
   p-c = 44;
   atomic_store_explicit(gp1, p, memory_order_release);
   p-b = 143;
   p-c = 144;
   atomic_store_explicit(gp2, p, memory_order_release);

 T2:   p = atomic_load_explicit(gp2, memory_order_consume);
   r1 = p-b;  /* Guaranteed to get 143. */
   q = atomic_load_explicit(gp1, memory_order_consume);
   if (p == q) {
   /* The compiler decides that q-c is same as p-c. */
   r2 = p-c; /* Could get 44 on weakly order system. */
   }

 The loads from gp1 and gp2 are, as you say, unordered, so you get what
 you get.

 And publishing a structure via one RCU-protected pointer, updating it,
 then publishing it via another pointer seems to me to be asking for
 trouble anyway.  If you really want to do something like that and still
 see consistency across all the fields in the structure, please put a lock
 in the structure and use it to guard updates and accesses to those fields.

 And here is a patch documenting the restrictions for the current Linux
 kernel.  The rules change a bit due to rcu_dereference() acting a bit
 differently than atomic_load_explicit(p, memory_order_consume).

 Thoughts?

That might serve as informal documentation for linux kernel
programmers about the bounds on the optimisations that you expect
compilers to do for common-case RCU code - and I guess that's what you
intend it to be for.   But I don't see how one can make it precise
enough to serve as a language definition, so that compiler people
could confidently say yes, we respect that, which I guess is what
you really need.  As a useful criterion, we should aim for something
precise enough that in a verified-compiler context you can
mathematically prove that the compiler will satisfy it  (even though
that won't happen anytime soon for GCC), and that analysis tool
authors can actually know what they're working with.   All this stuff
about you should avoid cancellation, and avoid masking with just a
small number of bits is just too vague.

The basic problem is that the compiler may be doing sophisticated
reasoning with a bunch of non-local knowledge that it's deduced from
the code, neither of which are well-understood, and here we have to
identify some envelope, expressive enough for RCU idioms, in which
that reasoning doesn't allow data/address dependencies to be removed
(and hence the hardware guarantee about them will be maintained at the
source level).

The C11 syntactic notion of dependency, whatever its faults, was at
least precise, could be reasoned about locally (just looking at the
syntactic code in question), and did do that.  The fact that current
compilers do optimisations that remove dependencies and will likely
have many bugs at present is besides the point - this was surely
intended as a *new* constraint on what they are allowed to do.  The
interesting question is really whether the compiler writers 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Paul E. McKenney
On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
 Hi Paul,
 
 On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
 wrote:
  On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
  On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
   On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.
   
I am a bit nervous about #3.  Any thoughts on it?
  
   I think that it might be worth pointing out as an example, and saying
   that code like
  
  p = atomic_read(consume);
  X;
  q = atomic_read(consume);
  Y;
  if (p == q)
   data = p-val;
  
   then the access of p-val is constrained to be data-dependent on
   *either* p or q, but you can't really tell which, since the compiler
   can decide that the values are interchangeable.
  
   I cannot for the life of me come up with a situation where this would
   matter, though. If X contains a fence, then that fence will be a
   stronger ordering than anything the consume through p would
   guarantee anyway. And if X does *not* contain a fence, then the
   atomic reads of p and q are unordered *anyway*, so then whether the
   ordering to the access through p is through p or q is kind of
   irrelevant. No?
 
  I can make a contrived litmus test for it, but you are right, the only
  time you can see it happen is when X has no barriers, in which case
  you don't have any ordering anyway -- both the compiler and the CPU can
  reorder the loads into p and q, and the read from p-val can, as you say,
  come from either pointer.
 
  For whatever it is worth, hear is the litmus test:
 
  T1:   p = kmalloc(...);
if (p == NULL)
deal_with_it();
p-a = 42;  /* Each field in its own cache line. */
p-b = 43;
p-c = 44;
atomic_store_explicit(gp1, p, memory_order_release);
p-b = 143;
p-c = 144;
atomic_store_explicit(gp2, p, memory_order_release);
 
  T2:   p = atomic_load_explicit(gp2, memory_order_consume);
r1 = p-b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q-c is same as p-c. */
r2 = p-c; /* Could get 44 on weakly order system. */
}
 
  The loads from gp1 and gp2 are, as you say, unordered, so you get what
  you get.
 
  And publishing a structure via one RCU-protected pointer, updating it,
  then publishing it via another pointer seems to me to be asking for
  trouble anyway.  If you really want to do something like that and still
  see consistency across all the fields in the structure, please put a lock
  in the structure and use it to guard updates and accesses to those fields.
 
  And here is a patch documenting the restrictions for the current Linux
  kernel.  The rules change a bit due to rcu_dereference() acting a bit
  differently than atomic_load_explicit(p, memory_order_consume).
 
  Thoughts?
 
 That might serve as informal documentation for linux kernel
 programmers about the bounds on the optimisations that you expect
 compilers to do for common-case RCU code - and I guess that's what you
 intend it to be for.   But I don't see how one can make it precise
 enough to serve as a language definition, so that compiler people
 could confidently say yes, we respect that, which I guess is what
 you really need.  As a useful criterion, we should aim for something
 precise enough that in a verified-compiler context you can
 mathematically prove that the compiler will satisfy it  (even though
 that won't happen anytime soon for GCC), and that analysis tool
 authors can actually know what they're working with.   All this stuff
 about you should avoid cancellation, and avoid masking with just a
 small number of bits is just too vague.

Understood, and yes, this is intended to document current compiler
behavior for the Linux kernel community.  It would not make sense to show
it to the C11 or C++11 communities, except perhaps as an informational
piece on current practice.

 The basic problem is that the compiler may be doing sophisticated
 reasoning with a bunch of non-local knowledge that it's deduced from
 the code, neither of which are well-understood, and here we have to
 identify some envelope, expressive enough for RCU idioms, in which
 that reasoning doesn't allow data/address dependencies to be removed
 (and hence the hardware guarantee about them will be maintained at the
 source level).
 
 The C11 syntactic notion of dependency, whatever its faults, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-28 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >  wrote:
> > >
> > > 3.  The comparison was against another RCU-protected pointer,
> > > where that other pointer was properly fetched using one
> > > of the RCU primitives.  Here it doesn't matter which pointer
> > > you use.  At least as long as the rcu_assign_pointer() for
> > > that other pointer happened after the last update to the
> > > pointed-to structure.
> > >
> > > I am a bit nervous about #3.  Any thoughts on it?
> > 
> > I think that it might be worth pointing out as an example, and saying
> > that code like
> > 
> >p = atomic_read(consume);
> >X;
> >q = atomic_read(consume);
> >Y;
> >if (p == q)
> > data = p->val;
> > 
> > then the access of "p->val" is constrained to be data-dependent on
> > *either* p or q, but you can't really tell which, since the compiler
> > can decide that the values are interchangeable.
> > 
> > I cannot for the life of me come up with a situation where this would
> > matter, though. If "X" contains a fence, then that fence will be a
> > stronger ordering than anything the consume through "p" would
> > guarantee anyway. And if "X" does *not* contain a fence, then the
> > atomic reads of p and q are unordered *anyway*, so then whether the
> > ordering to the access through "p" is through p or q is kind of
> > irrelevant. No?
> 
> I can make a contrived litmus test for it, but you are right, the only
> time you can see it happen is when X has no barriers, in which case
> you don't have any ordering anyway -- both the compiler and the CPU can
> reorder the loads into p and q, and the read from p->val can, as you say,
> come from either pointer.
> 
> For whatever it is worth, hear is the litmus test:
> 
> T1:   p = kmalloc(...);
>   if (p == NULL)
>   deal_with_it();
>   p->a = 42;  /* Each field in its own cache line. */
>   p->b = 43;
>   p->c = 44;
>   atomic_store_explicit(, p, memory_order_release);
>   p->b = 143;
>   p->c = 144;
>   atomic_store_explicit(, p, memory_order_release);
> 
> T2:   p = atomic_load_explicit(, memory_order_consume);
>   r1 = p->b;  /* Guaranteed to get 143. */
>   q = atomic_load_explicit(, memory_order_consume);
>   if (p == q) {
>   /* The compiler decides that q->c is same as p->c. */
>   r2 = p->c; /* Could get 44 on weakly order system. */
>   }
> 
> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> you get.
> 
> And publishing a structure via one RCU-protected pointer, updating it,
> then publishing it via another pointer seems to me to be asking for
> trouble anyway.  If you really want to do something like that and still
> see consistency across all the fields in the structure, please put a lock
> in the structure and use it to guard updates and accesses to those fields.

And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(, memory_order_consume).

Thoughts?

Thanx, Paul



documentation: Record rcu_dereference() value mishandling

Recent LKML discussings (see http://lwn.net/Articles/586838/ and
http://lwn.net/Articles/588300/ for the LWN writeups) brought out
some ways of misusing the return value from rcu_dereference() that
are not necessarily completely intuitive.  This commit therefore
documents what can and cannot safely be done with these values.

Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index fa57139f50bf..f773a264ae02 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ lockdep-splat.txt
- RCU Lockdep splats explained.
 NMI-RCU.txt
- Using RCU to Protect Dynamic NMI Handlers
+rcu_dereference.txt
+   - Proper care and feeding of return values from rcu_dereference()
 rcubarrier.txt
- RCU and Unloadable Modules
 rculist_nulls.txt
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 9d10d1db16a5..877947130ebe 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -114,12 +114,16 @@ over a rather long period of time, but improvements are 
always welcome!
http://www.openvms.compaq.com/wizard/wiz_2637.html
 
The rcu_dereference() primitive is also an excellent
-   documentation aid, letting the person reading the code
-   know exactly which pointers are protected by RCU.
+   documentation aid, letting the person reading the
+   code know 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-28 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
  On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  
   3.  The comparison was against another RCU-protected pointer,
   where that other pointer was properly fetched using one
   of the RCU primitives.  Here it doesn't matter which pointer
   you use.  At least as long as the rcu_assign_pointer() for
   that other pointer happened after the last update to the
   pointed-to structure.
  
   I am a bit nervous about #3.  Any thoughts on it?
  
  I think that it might be worth pointing out as an example, and saying
  that code like
  
 p = atomic_read(consume);
 X;
 q = atomic_read(consume);
 Y;
 if (p == q)
  data = p-val;
  
  then the access of p-val is constrained to be data-dependent on
  *either* p or q, but you can't really tell which, since the compiler
  can decide that the values are interchangeable.
  
  I cannot for the life of me come up with a situation where this would
  matter, though. If X contains a fence, then that fence will be a
  stronger ordering than anything the consume through p would
  guarantee anyway. And if X does *not* contain a fence, then the
  atomic reads of p and q are unordered *anyway*, so then whether the
  ordering to the access through p is through p or q is kind of
  irrelevant. No?
 
 I can make a contrived litmus test for it, but you are right, the only
 time you can see it happen is when X has no barriers, in which case
 you don't have any ordering anyway -- both the compiler and the CPU can
 reorder the loads into p and q, and the read from p-val can, as you say,
 come from either pointer.
 
 For whatever it is worth, hear is the litmus test:
 
 T1:   p = kmalloc(...);
   if (p == NULL)
   deal_with_it();
   p-a = 42;  /* Each field in its own cache line. */
   p-b = 43;
   p-c = 44;
   atomic_store_explicit(gp1, p, memory_order_release);
   p-b = 143;
   p-c = 144;
   atomic_store_explicit(gp2, p, memory_order_release);
 
 T2:   p = atomic_load_explicit(gp2, memory_order_consume);
   r1 = p-b;  /* Guaranteed to get 143. */
   q = atomic_load_explicit(gp1, memory_order_consume);
   if (p == q) {
   /* The compiler decides that q-c is same as p-c. */
   r2 = p-c; /* Could get 44 on weakly order system. */
   }
 
 The loads from gp1 and gp2 are, as you say, unordered, so you get what
 you get.
 
 And publishing a structure via one RCU-protected pointer, updating it,
 then publishing it via another pointer seems to me to be asking for
 trouble anyway.  If you really want to do something like that and still
 see consistency across all the fields in the structure, please put a lock
 in the structure and use it to guard updates and accesses to those fields.

And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(p, memory_order_consume).

Thoughts?

Thanx, Paul



documentation: Record rcu_dereference() value mishandling

Recent LKML discussings (see http://lwn.net/Articles/586838/ and
http://lwn.net/Articles/588300/ for the LWN writeups) brought out
some ways of misusing the return value from rcu_dereference() that
are not necessarily completely intuitive.  This commit therefore
documents what can and cannot safely be done with these values.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index fa57139f50bf..f773a264ae02 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ lockdep-splat.txt
- RCU Lockdep splats explained.
 NMI-RCU.txt
- Using RCU to Protect Dynamic NMI Handlers
+rcu_dereference.txt
+   - Proper care and feeding of return values from rcu_dereference()
 rcubarrier.txt
- RCU and Unloadable Modules
 rculist_nulls.txt
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 9d10d1db16a5..877947130ebe 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -114,12 +114,16 @@ over a rather long period of time, but improvements are 
always welcome!
http://www.openvms.compaq.com/wizard/wiz_2637.html
 
The rcu_dereference() primitive is also an excellent
-   documentation aid, letting the person reading the code
-   know exactly which pointers are protected by RCU.
+   documentation aid, letting the person reading the
+   code know exactly which pointers are protected by RCU.
Please note 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > >  wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > > results in another chained pointer in that same pointer chain.
> > > > The results of addition and subtraction operations that cancel
> > > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > > is a pointer to char) are implementation defined.
> > > >
> > > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > > applied to a chained pointer and an integer for the purposes
> > > > of alignment and pointer translation results in another
> > > > chained pointer in that same pointer chain.  Other uses
> > > > of bitwise operators on chained pointers (for example,
> > > > "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>  wrote:
> >
> > 3.  The comparison was against another RCU-protected pointer,
> > where that other pointer was properly fetched using one
> > of the RCU primitives.  Here it doesn't matter which pointer
> > you use.  At least as long as the rcu_assign_pointer() for
> > that other pointer happened after the last update to the
> > pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>p = atomic_read(consume);
>X;
>q = atomic_read(consume);
>Y;
>if (p == q)
> data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.
> 
> I cannot for the life of me come up with a situation where this would
> matter, though. If "X" contains a fence, then that fence will be a
> stronger ordering than anything the consume through "p" would
> guarantee anyway. And if "X" does *not* contain a fence, then the
> atomic reads of p and q are unordered *anyway*, so then whether the
> ordering to the access through "p" is through p or q is kind of
> irrelevant. No?

I can make a contrived litmus test for it, but you are right, the only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU can
reorder the loads into p and q, and the read from p->val can, as you say,
come from either pointer.

For whatever it is worth, hear is the litmus test:

T1: p = kmalloc(...);
if (p == NULL)
deal_with_it();
p->a = 42;  /* Each field in its own cache line. */
p->b = 43;
p->c = 44;
atomic_store_explicit(, p, memory_order_release);
p->b = 143;
p->c = 144;
atomic_store_explicit(, p, memory_order_release);

T2: p = atomic_load_explicit(, memory_order_consume);
r1 = p->b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(, memory_order_consume);
if (p == q) {
/* The compiler decides that q->c is same as p->c. */
r2 = p->c; /* Could get 44 on weakly order system. */
}

The loads from gp1 and gp2 are, as you say, unordered, so you get what
you get.

And publishing a structure via one RCU-protected pointer, updating it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and still
see consistency across all the fields in the structure, please put a lock
in the structure and use it to guard updates and accesses to those fields.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 wrote:
>
> 3.  The comparison was against another RCU-protected pointer,
> where that other pointer was properly fetched using one
> of the RCU primitives.  Here it doesn't matter which pointer
> you use.  At least as long as the rcu_assign_pointer() for
> that other pointer happened after the last update to the
> pointed-to structure.
>
> I am a bit nervous about #3.  Any thoughts on it?

I think that it might be worth pointing out as an example, and saying
that code like

   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p->val;

then the access of "p->val" is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.

I cannot for the life of me come up with a situation where this would
matter, though. If "X" contains a fence, then that fence will be a
stronger ordering than anything the consume through "p" would
guarantee anyway. And if "X" does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through "p" is through p or q is kind of
irrelevant. No?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > >  wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > > results in another chained pointer in that same pointer chain.
> > > > The results of addition and subtraction operations that cancel
> > > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > > is a pointer to char) are implementation defined.
> > > >
> > > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > > applied to a chained pointer and an integer for the purposes
> > > > of alignment and pointer translation results in another
> > > > chained pointer in that same pointer chain.  Other uses
> > > > of bitwise operators on chained pointers (for example,
> > > > "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
> >
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> 
> I suspect it's hard to really strictly define, but at the same time I
> actually think that compiler writers (and users, for that matter) have
> little problem understanding the concept and intent.
> 
> I do think that listing operations might be useful to give good
> examples of what is a "necessary" value, and - perhaps more
> importantly - what can break the value from being "necessary".
> Especially the gotchas.
> 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Ok, I'm going to cut most of the verbiage since it's long and I'm not
> commenting on most of it.
> 
> But
> 
> > Based on these thoughts, we could specify the new mo_consume guarantees
> > roughly as follows:
> >
> > An evaluation E (in an execution) has a value dependency to an
> > atomic and mo_consume load L (in an execution) iff:
> > * L's type holds more than one value (ruling out constants
> > etc.),
> > * L is sequenced-before E,
> > * L's result is used by the abstract machine to compute E,
> > * E is value-dependency-preserving code (defined below), and
> > * at the time of execution of E, L can possibly have returned at
> > least two different values under the assumption that L itself
> > could have returned any value allowed by L's type.
> >
> > If a memory access A's targeted memory location has a value
> > dependency on a mo_consume load L, and an action X
> > inter-thread-happens-before L, then X happens-before A.
> 
> I think this mostly works.
> 
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == ) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.

FWIW, I am looking through the kernel for instances of your first
"if (p == ) .. use p" limus test.  All the ones I have found
thus far are OK for one of the following reasons:

1.  The comparison was against NULL, so you don't get to dereference
the pointer anyway.  About 80% are in this category.

2.  The comparison was against another pointer, but there were no
dereferences afterwards.  Here is an example of what these
can look like:

list_for_each_entry_rcu(p, , next)
if (p == )
return; /* "p" goes out of scope. */

3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.

I am a bit nervous about #3.  Any thoughts on it?

Some other reasons why it would be OK to dereference after a comparison:

4.  The pointed-to data is constant: (a) It was initialized at
boot time, (b) the update-side lock is held, (c) we are
running in a kthread and the data was initialized before the
kthread was created, (d) we are running in a module, and
the data was initialized during or before module-init time
for that module.  And many more besides, involving pretty
much every 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> 
> On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> >  wrote:
> > >
> > > Good points.  How about the following replacements?
> > >
> > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > results in another chained pointer in that same pointer chain.
> > > The results of addition and subtraction operations that cancel
> > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > is a pointer to char) are implementation defined.
> > >
> > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > applied to a chained pointer and an integer for the purposes
> > > of alignment and pointer translation results in another
> > > chained pointer in that same pointer chain.  Other uses
> > > of bitwise operators on chained pointers (for example,
> > > "p|~0") are implementation defined.
> > 
> > Quite frankly, I think all of this language that is about the actual
> > operations is irrelevant and wrong.
> > 
> > It's not going to help compiler writers, and it sure isn't going to
> > help users that read this.
> > 
> > Why not just talk about "value chains" and that any operations that
> > restrict the value range severely end up breaking the chain. There is
> > no point in listing the operations individually, because every single
> > operation *can* restrict things. Listing individual operations and
> > depdendencies is just fundamentally wrong.
> 
> [...]
> 
> > The *only* thing that matters for all of them is whether they are
> > "value-preserving", or whether they drop so much information that the
> > compiler might decide to use a control dependency instead. That's true
> > for every single one of them.
> > 
> > Similarly, actual true control dependencies that limit the problem
> > space sufficiently that the actual pointer value no longer has
> > significant information in it (see the above example) are also things
> > that remove information to the point that only a control dependency
> > remains. Even when the value itself is not modified in any way at all.
> 
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".
> 
> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Thank you very much for putting this forward!  I must confess that I was
stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
is quite clearly way bogus.

One possible saving grace:  From discussions at the standards committee
meeting a few weeks ago, there is a some chance that the committee will
be willing to do a rip-and-replace on the current memory_order_consume
wording, without provisions for backwards compatibility with the current
bogosity.

> What we'd like to capture is that a value originating from a mo_consume
> load is "necessary" for a computation (e.g., it "cannot" be replaced
> with value predictions and/or control dependencies); if that's the case
> in the program, we can reasonably assume that a compiler implementation
> will transform this into a data dependency, which will then lead to
> ordering guarantees by the HW.
> 
> However, we need to specify when a value is "necessary".  We could say
> that this is implementation-defined, and use a set of litmus tests
> (e.g., like those discussed in the thread) to roughly carve out what a
> programmer could expect.  This may even be practical for a project like
> the Linux kernel that follows strict project-internal rules and pays a
> lot of attention to what the particular implementations of compilers
> expected to compile the kernel are doing.  However, I think this
> approach would be too vague for the standard and for many other
> programs/projects.

I agree that a number of other projects would have more need for this than
might the kernel.  Please understand that this is in no way denigrating
the intelligence of other projects' members.  It is just that many of
them have only recently started seriously thinking about concurrency.
In contrast, the Linux kernel community has been doing concurrency since
the mid-1990s.  Projects with less experience with concurrency will
probably need more help, from the compiler and from elsewhere as 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
>
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".

I suspect it's hard to really strictly define, but at the same time I
actually think that compiler writers (and users, for that matter) have
little problem understanding the concept and intent.

I do think that listing operations might be useful to give good
examples of what is a "necessary" value, and - perhaps more
importantly - what can break the value from being "necessary".
Especially the gotchas.

> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Ok, I'm going to cut most of the verbiage since it's long and I'm not
commenting on most of it.

But

> Based on these thoughts, we could specify the new mo_consume guarantees
> roughly as follows:
>
> An evaluation E (in an execution) has a value dependency to an
> atomic and mo_consume load L (in an execution) iff:
> * L's type holds more than one value (ruling out constants
> etc.),
> * L is sequenced-before E,
> * L's result is used by the abstract machine to compute E,
> * E is value-dependency-preserving code (defined below), and
> * at the time of execution of E, L can possibly have returned at
> least two different values under the assumption that L itself
> could have returned any value allowed by L's type.
>
> If a memory access A's targeted memory location has a value
> dependency on a mo_consume load L, and an action X
> inter-thread-happens-before L, then X happens-before A.

I think this mostly works.

> Regarding the latter, we make a fresh start at each mo_consume load (ie,
> we assume we know nothing -- L could have returned any possible value);
> I believe this is easier to reason about than other scopes like function
> granularities (what happens on inlining?), or translation units.  It
> should also be simple to implement for compilers, and would hopefully
> not constrain optimization too much.
>
> [...]
>
> Paul's litmus test would work, because we guarantee to the programmer
> that it can assume that the mo_consume load would return any value
> allowed by the type; effectively, this forbids the compiler analysis
> Paul thought about:

So realistically, since with the new wording we can ignore the silly
cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
cases ("if (p == ) .. use p"), and you would forbid the
"global value range optimization case" that Paul bright up, what
remains would seem to be just really subtle compiler transformations
of data dependencies to control dependencies.

And the only such thing I can think of is basically compiler-initiated
value-prediction, presumably directed by PGO (since now if the value
prediction is in the source code, it's considered to break the value
chain).

The good thing is that afaik, value-prediction is largely not used in
real life, afaik. There are lots of papers on it, but I don't think
anybody actually does it (although I can easily see some
specint-specific optimization pattern that is build up around it).

And even value prediction is actually fine, as long as the compiler
can see the memory *source* of the value prediction (and it isn't a
mo_consume). So it really ends up limiting your value prediction in
very simple ways: you cannot do it to function arguments if they are
registers. But you can still do value prediction on values you loaded
from memory, if you can actually *see* that memory op.

Of course, on more strongly ordered CPU's, even that "register
argument" limitation goes away.

So I agree that there is basically no real optimization constraint.
Value-prediction is of dubious value to begin with, and the actual
constraint on its use if some compiler writer really wants to is not
onerous.

> What I have in mind is roughly the following (totally made-up syntax --
> suggestions for how to do this properly are very welcome):
> * Have a type modifier (eg, like restrict), that specifies that
> operations on data of this type are preserving value dependencies:

So I'm not violently opposed, but I think the upsides are not great.
Note that my earlier suggestion to use "restrict" wasn't because I
believed the annotation itself would be visible, but basically just as
a legalistic promise to the compiler that *if* it found an alias, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Torvald Riegel
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
>  wrote:
> >
> > Good points.  How about the following replacements?
> >
> > 3.  Adding or subtracting an integer to/from a chained pointer
> > results in another chained pointer in that same pointer chain.
> > The results of addition and subtraction operations that cancel
> > the chained pointer's value (for example, "p-(long)p" where "p"
> > is a pointer to char) are implementation defined.
> >
> > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > applied to a chained pointer and an integer for the purposes
> > of alignment and pointer translation results in another
> > chained pointer in that same pointer chain.  Other uses
> > of bitwise operators on chained pointers (for example,
> > "p|~0") are implementation defined.
> 
> Quite frankly, I think all of this language that is about the actual
> operations is irrelevant and wrong.
> 
> It's not going to help compiler writers, and it sure isn't going to
> help users that read this.
> 
> Why not just talk about "value chains" and that any operations that
> restrict the value range severely end up breaking the chain. There is
> no point in listing the operations individually, because every single
> operation *can* restrict things. Listing individual operations and
> depdendencies is just fundamentally wrong.

[...]

> The *only* thing that matters for all of them is whether they are
> "value-preserving", or whether they drop so much information that the
> compiler might decide to use a control dependency instead. That's true
> for every single one of them.
> 
> Similarly, actual true control dependencies that limit the problem
> space sufficiently that the actual pointer value no longer has
> significant information in it (see the above example) are also things
> that remove information to the point that only a control dependency
> remains. Even when the value itself is not modified in any way at all.

I agree that just considering syntactic properties of the program seems
to be insufficient.  Making it instead depend on whether there is a
"semantic" dependency due to a value being "necessary" to compute a
result seems better.  However, whether a value is "necessary" might not
be obvious, and I understand Paul's argument that he does not want to
have to reason about all potential compiler optimizations.  Thus, I
believe we need to specify when a value is "necessary".

I have a suggestion for a somewhat different formulation of the feature
that you seem to have in mind, which I'll discuss below.  Excuse the
verbosity of the following, but I'd rather like to avoid
misunderstandings than save a few words.


What we'd like to capture is that a value originating from a mo_consume
load is "necessary" for a computation (e.g., it "cannot" be replaced
with value predictions and/or control dependencies); if that's the case
in the program, we can reasonably assume that a compiler implementation
will transform this into a data dependency, which will then lead to
ordering guarantees by the HW.

However, we need to specify when a value is "necessary".  We could say
that this is implementation-defined, and use a set of litmus tests
(e.g., like those discussed in the thread) to roughly carve out what a
programmer could expect.  This may even be practical for a project like
the Linux kernel that follows strict project-internal rules and pays a
lot of attention to what the particular implementations of compilers
expected to compile the kernel are doing.  However, I think this
approach would be too vague for the standard and for many other
programs/projects.


One way to understand "necessary" would be to say that if a mo_consume
load can result in more than V different values, then the actual value
is "unknown", and thus "necessary" to compute anything based on it.
(But this is flawed, as discussed below.)

However, how big should V be?  If it's larger than 1, atomic bool cannot
be used with mo_consume, which seems weird.
If V is 1, then Linus' litmus tests work (but Paul's doesn't; see
below), but the compiler must not try to predict more than one value.
This holds for any choice of V, so there always is an *additional*
constraint on code generation for operations that are meant to take part
in such "value dependencies".  The bigger V might be, the less likely it
should be for this to actually constrain a particular compiler's
optimizations (e.g., while it might be meaningful to use value
prediction for two or three values, it's probably not for 1000s).
Nonetheless, if we don't want to predict the future, we need to specify
V.  Given that we always have some constraint for code generation
anyway, and given that V > 1 might be an arbitrary-looking constraint
and disallows use on atomic bool, I believe V should be 1.

Furthermore, there is a problem in saying "a load can 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Torvald Riegel
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  Good points.  How about the following replacements?
 
  3.  Adding or subtracting an integer to/from a chained pointer
  results in another chained pointer in that same pointer chain.
  The results of addition and subtraction operations that cancel
  the chained pointer's value (for example, p-(long)p where p
  is a pointer to char) are implementation defined.
 
  4.  Bitwise operators (, |, ^, and I suppose also ~)
  applied to a chained pointer and an integer for the purposes
  of alignment and pointer translation results in another
  chained pointer in that same pointer chain.  Other uses
  of bitwise operators on chained pointers (for example,
  p|~0) are implementation defined.
 
 Quite frankly, I think all of this language that is about the actual
 operations is irrelevant and wrong.
 
 It's not going to help compiler writers, and it sure isn't going to
 help users that read this.
 
 Why not just talk about value chains and that any operations that
 restrict the value range severely end up breaking the chain. There is
 no point in listing the operations individually, because every single
 operation *can* restrict things. Listing individual operations and
 depdendencies is just fundamentally wrong.

[...]

 The *only* thing that matters for all of them is whether they are
 value-preserving, or whether they drop so much information that the
 compiler might decide to use a control dependency instead. That's true
 for every single one of them.
 
 Similarly, actual true control dependencies that limit the problem
 space sufficiently that the actual pointer value no longer has
 significant information in it (see the above example) are also things
 that remove information to the point that only a control dependency
 remains. Even when the value itself is not modified in any way at all.

I agree that just considering syntactic properties of the program seems
to be insufficient.  Making it instead depend on whether there is a
semantic dependency due to a value being necessary to compute a
result seems better.  However, whether a value is necessary might not
be obvious, and I understand Paul's argument that he does not want to
have to reason about all potential compiler optimizations.  Thus, I
believe we need to specify when a value is necessary.

I have a suggestion for a somewhat different formulation of the feature
that you seem to have in mind, which I'll discuss below.  Excuse the
verbosity of the following, but I'd rather like to avoid
misunderstandings than save a few words.


What we'd like to capture is that a value originating from a mo_consume
load is necessary for a computation (e.g., it cannot be replaced
with value predictions and/or control dependencies); if that's the case
in the program, we can reasonably assume that a compiler implementation
will transform this into a data dependency, which will then lead to
ordering guarantees by the HW.

However, we need to specify when a value is necessary.  We could say
that this is implementation-defined, and use a set of litmus tests
(e.g., like those discussed in the thread) to roughly carve out what a
programmer could expect.  This may even be practical for a project like
the Linux kernel that follows strict project-internal rules and pays a
lot of attention to what the particular implementations of compilers
expected to compile the kernel are doing.  However, I think this
approach would be too vague for the standard and for many other
programs/projects.


One way to understand necessary would be to say that if a mo_consume
load can result in more than V different values, then the actual value
is unknown, and thus necessary to compute anything based on it.
(But this is flawed, as discussed below.)

However, how big should V be?  If it's larger than 1, atomic bool cannot
be used with mo_consume, which seems weird.
If V is 1, then Linus' litmus tests work (but Paul's doesn't; see
below), but the compiler must not try to predict more than one value.
This holds for any choice of V, so there always is an *additional*
constraint on code generation for operations that are meant to take part
in such value dependencies.  The bigger V might be, the less likely it
should be for this to actually constrain a particular compiler's
optimizations (e.g., while it might be meaningful to use value
prediction for two or three values, it's probably not for 1000s).
Nonetheless, if we don't want to predict the future, we need to specify
V.  Given that we always have some constraint for code generation
anyway, and given that V  1 might be an arbitrary-looking constraint
and disallows use on atomic bool, I believe V should be 1.

Furthermore, there is a problem in saying a load can result in more
than one value because in a deterministic 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote:

 I agree that just considering syntactic properties of the program seems
 to be insufficient.  Making it instead depend on whether there is a
 semantic dependency due to a value being necessary to compute a
 result seems better.  However, whether a value is necessary might not
 be obvious, and I understand Paul's argument that he does not want to
 have to reason about all potential compiler optimizations.  Thus, I
 believe we need to specify when a value is necessary.

I suspect it's hard to really strictly define, but at the same time I
actually think that compiler writers (and users, for that matter) have
little problem understanding the concept and intent.

I do think that listing operations might be useful to give good
examples of what is a necessary value, and - perhaps more
importantly - what can break the value from being necessary.
Especially the gotchas.

 I have a suggestion for a somewhat different formulation of the feature
 that you seem to have in mind, which I'll discuss below.  Excuse the
 verbosity of the following, but I'd rather like to avoid
 misunderstandings than save a few words.

Ok, I'm going to cut most of the verbiage since it's long and I'm not
commenting on most of it.

But

 Based on these thoughts, we could specify the new mo_consume guarantees
 roughly as follows:

 An evaluation E (in an execution) has a value dependency to an
 atomic and mo_consume load L (in an execution) iff:
 * L's type holds more than one value (ruling out constants
 etc.),
 * L is sequenced-before E,
 * L's result is used by the abstract machine to compute E,
 * E is value-dependency-preserving code (defined below), and
 * at the time of execution of E, L can possibly have returned at
 least two different values under the assumption that L itself
 could have returned any value allowed by L's type.

 If a memory access A's targeted memory location has a value
 dependency on a mo_consume load L, and an action X
 inter-thread-happens-before L, then X happens-before A.

I think this mostly works.

 Regarding the latter, we make a fresh start at each mo_consume load (ie,
 we assume we know nothing -- L could have returned any possible value);
 I believe this is easier to reason about than other scopes like function
 granularities (what happens on inlining?), or translation units.  It
 should also be simple to implement for compilers, and would hopefully
 not constrain optimization too much.

 [...]

 Paul's litmus test would work, because we guarantee to the programmer
 that it can assume that the mo_consume load would return any value
 allowed by the type; effectively, this forbids the compiler analysis
 Paul thought about:

So realistically, since with the new wording we can ignore the silly
cases (ie p-p) and we can ignore the trivial-to-optimize compiler
cases (if (p == variable) .. use p), and you would forbid the
global value range optimization case that Paul bright up, what
remains would seem to be just really subtle compiler transformations
of data dependencies to control dependencies.

And the only such thing I can think of is basically compiler-initiated
value-prediction, presumably directed by PGO (since now if the value
prediction is in the source code, it's considered to break the value
chain).

The good thing is that afaik, value-prediction is largely not used in
real life, afaik. There are lots of papers on it, but I don't think
anybody actually does it (although I can easily see some
specint-specific optimization pattern that is build up around it).

And even value prediction is actually fine, as long as the compiler
can see the memory *source* of the value prediction (and it isn't a
mo_consume). So it really ends up limiting your value prediction in
very simple ways: you cannot do it to function arguments if they are
registers. But you can still do value prediction on values you loaded
from memory, if you can actually *see* that memory op.

Of course, on more strongly ordered CPU's, even that register
argument limitation goes away.

So I agree that there is basically no real optimization constraint.
Value-prediction is of dubious value to begin with, and the actual
constraint on its use if some compiler writer really wants to is not
onerous.

 What I have in mind is roughly the following (totally made-up syntax --
 suggestions for how to do this properly are very welcome):
 * Have a type modifier (eg, like restrict), that specifies that
 operations on data of this type are preserving value dependencies:

So I'm not violently opposed, but I think the upsides are not great.
Note that my earlier suggestion to use restrict wasn't because I
believed the annotation itself would be visible, but basically just as
a legalistic promise to the compiler that *if* it found an alias, then
it didn't need to worry about 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
 
 On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
  On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  
   Good points.  How about the following replacements?
  
   3.  Adding or subtracting an integer to/from a chained pointer
   results in another chained pointer in that same pointer chain.
   The results of addition and subtraction operations that cancel
   the chained pointer's value (for example, p-(long)p where p
   is a pointer to char) are implementation defined.
  
   4.  Bitwise operators (, |, ^, and I suppose also ~)
   applied to a chained pointer and an integer for the purposes
   of alignment and pointer translation results in another
   chained pointer in that same pointer chain.  Other uses
   of bitwise operators on chained pointers (for example,
   p|~0) are implementation defined.
  
  Quite frankly, I think all of this language that is about the actual
  operations is irrelevant and wrong.
  
  It's not going to help compiler writers, and it sure isn't going to
  help users that read this.
  
  Why not just talk about value chains and that any operations that
  restrict the value range severely end up breaking the chain. There is
  no point in listing the operations individually, because every single
  operation *can* restrict things. Listing individual operations and
  depdendencies is just fundamentally wrong.
 
 [...]
 
  The *only* thing that matters for all of them is whether they are
  value-preserving, or whether they drop so much information that the
  compiler might decide to use a control dependency instead. That's true
  for every single one of them.
  
  Similarly, actual true control dependencies that limit the problem
  space sufficiently that the actual pointer value no longer has
  significant information in it (see the above example) are also things
  that remove information to the point that only a control dependency
  remains. Even when the value itself is not modified in any way at all.
 
 I agree that just considering syntactic properties of the program seems
 to be insufficient.  Making it instead depend on whether there is a
 semantic dependency due to a value being necessary to compute a
 result seems better.  However, whether a value is necessary might not
 be obvious, and I understand Paul's argument that he does not want to
 have to reason about all potential compiler optimizations.  Thus, I
 believe we need to specify when a value is necessary.
 
 I have a suggestion for a somewhat different formulation of the feature
 that you seem to have in mind, which I'll discuss below.  Excuse the
 verbosity of the following, but I'd rather like to avoid
 misunderstandings than save a few words.

Thank you very much for putting this forward!  I must confess that I was
stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
is quite clearly way bogus.

One possible saving grace:  From discussions at the standards committee
meeting a few weeks ago, there is a some chance that the committee will
be willing to do a rip-and-replace on the current memory_order_consume
wording, without provisions for backwards compatibility with the current
bogosity.

 What we'd like to capture is that a value originating from a mo_consume
 load is necessary for a computation (e.g., it cannot be replaced
 with value predictions and/or control dependencies); if that's the case
 in the program, we can reasonably assume that a compiler implementation
 will transform this into a data dependency, which will then lead to
 ordering guarantees by the HW.
 
 However, we need to specify when a value is necessary.  We could say
 that this is implementation-defined, and use a set of litmus tests
 (e.g., like those discussed in the thread) to roughly carve out what a
 programmer could expect.  This may even be practical for a project like
 the Linux kernel that follows strict project-internal rules and pays a
 lot of attention to what the particular implementations of compilers
 expected to compile the kernel are doing.  However, I think this
 approach would be too vague for the standard and for many other
 programs/projects.

I agree that a number of other projects would have more need for this than
might the kernel.  Please understand that this is in no way denigrating
the intelligence of other projects' members.  It is just that many of
them have only recently started seriously thinking about concurrency.
In contrast, the Linux kernel community has been doing concurrency since
the mid-1990s.  Projects with less experience with concurrency will
probably need more help, from the compiler and from elsewhere as well.

Your proposal looks quite promising at first glance.  But rather than
try and comment on it immediately, I am going to take a number 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote:
 
  I agree that just considering syntactic properties of the program seems
  to be insufficient.  Making it instead depend on whether there is a
  semantic dependency due to a value being necessary to compute a
  result seems better.  However, whether a value is necessary might not
  be obvious, and I understand Paul's argument that he does not want to
  have to reason about all potential compiler optimizations.  Thus, I
  believe we need to specify when a value is necessary.
 
 I suspect it's hard to really strictly define, but at the same time I
 actually think that compiler writers (and users, for that matter) have
 little problem understanding the concept and intent.
 
 I do think that listing operations might be useful to give good
 examples of what is a necessary value, and - perhaps more
 importantly - what can break the value from being necessary.
 Especially the gotchas.
 
  I have a suggestion for a somewhat different formulation of the feature
  that you seem to have in mind, which I'll discuss below.  Excuse the
  verbosity of the following, but I'd rather like to avoid
  misunderstandings than save a few words.
 
 Ok, I'm going to cut most of the verbiage since it's long and I'm not
 commenting on most of it.
 
 But
 
  Based on these thoughts, we could specify the new mo_consume guarantees
  roughly as follows:
 
  An evaluation E (in an execution) has a value dependency to an
  atomic and mo_consume load L (in an execution) iff:
  * L's type holds more than one value (ruling out constants
  etc.),
  * L is sequenced-before E,
  * L's result is used by the abstract machine to compute E,
  * E is value-dependency-preserving code (defined below), and
  * at the time of execution of E, L can possibly have returned at
  least two different values under the assumption that L itself
  could have returned any value allowed by L's type.
 
  If a memory access A's targeted memory location has a value
  dependency on a mo_consume load L, and an action X
  inter-thread-happens-before L, then X happens-before A.
 
 I think this mostly works.
 
  Regarding the latter, we make a fresh start at each mo_consume load (ie,
  we assume we know nothing -- L could have returned any possible value);
  I believe this is easier to reason about than other scopes like function
  granularities (what happens on inlining?), or translation units.  It
  should also be simple to implement for compilers, and would hopefully
  not constrain optimization too much.
 
  [...]
 
  Paul's litmus test would work, because we guarantee to the programmer
  that it can assume that the mo_consume load would return any value
  allowed by the type; effectively, this forbids the compiler analysis
  Paul thought about:
 
 So realistically, since with the new wording we can ignore the silly
 cases (ie p-p) and we can ignore the trivial-to-optimize compiler
 cases (if (p == variable) .. use p), and you would forbid the
 global value range optimization case that Paul bright up, what
 remains would seem to be just really subtle compiler transformations
 of data dependencies to control dependencies.

FWIW, I am looking through the kernel for instances of your first
if (p == variable) .. use p limus test.  All the ones I have found
thus far are OK for one of the following reasons:

1.  The comparison was against NULL, so you don't get to dereference
the pointer anyway.  About 80% are in this category.

2.  The comparison was against another pointer, but there were no
dereferences afterwards.  Here is an example of what these
can look like:

list_for_each_entry_rcu(p, head, next)
if (p == variable)
return; /* p goes out of scope. */

3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.

I am a bit nervous about #3.  Any thoughts on it?

Some other reasons why it would be OK to dereference after a comparison:

4.  The pointed-to data is constant: (a) It was initialized at
boot time, (b) the update-side lock is held, (c) we are
running in a kthread and the data was initialized before the
kthread was created, (d) we are running in a module, and
the data was initialized during or before module-init time
for that module.  And many more besides, involving pretty
much every kernel primitive that makes something run later.

5.  All subsequent dereferences 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
  
  On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
   On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
Good points.  How about the following replacements?
   
3.  Adding or subtracting an integer to/from a chained pointer
results in another chained pointer in that same pointer chain.
The results of addition and subtraction operations that cancel
the chained pointer's value (for example, p-(long)p where p
is a pointer to char) are implementation defined.
   
4.  Bitwise operators (, |, ^, and I suppose also ~)
applied to a chained pointer and an integer for the purposes
of alignment and pointer translation results in another
chained pointer in that same pointer chain.  Other uses
of bitwise operators on chained pointers (for example,
p|~0) are implementation defined.
   
   Quite frankly, I think all of this language that is about the actual
   operations is irrelevant and wrong.
   
   It's not going to help compiler writers, and it sure isn't going to
   help users that read this.
   
   Why not just talk about value chains and that any operations that
   restrict the value range severely end up breaking the chain. There is
   no point in listing the operations individually, because every single
   operation *can* restrict things. Listing individual operations and
   depdendencies is just fundamentally wrong.
  
  [...]
  
   The *only* thing that matters for all of them is whether they are
   value-preserving, or whether they drop so much information that the
   compiler might decide to use a control dependency instead. That's true
   for every single one of them.
   
   Similarly, actual true control dependencies that limit the problem
   space sufficiently that the actual pointer value no longer has
   significant information in it (see the above example) are also things
   that remove information to the point that only a control dependency
   remains. Even when the value itself is not modified in any way at all.
  
  I agree that just considering syntactic properties of the program seems
  to be insufficient.  Making it instead depend on whether there is a
  semantic dependency due to a value being necessary to compute a
  result seems better.  However, whether a value is necessary might not
  be obvious, and I understand Paul's argument that he does not want to
  have to reason about all potential compiler optimizations.  Thus, I
  believe we need to specify when a value is necessary.
  
  I have a suggestion for a somewhat different formulation of the feature
  that you seem to have in mind, which I'll discuss below.  Excuse the
  verbosity of the following, but I'd rather like to avoid
  misunderstandings than save a few words.
 
 Thank you very much for putting this forward!  I must confess that I was
 stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
 is quite clearly way bogus.
 
 One possible saving grace:  From discussions at the standards committee
 meeting a few weeks ago, there is a some chance that the committee will
 be willing to do a rip-and-replace on the current memory_order_consume
 wording, without provisions for backwards compatibility with the current
 bogosity.
 
  What we'd like to capture is that a value originating from a mo_consume
  load is necessary for a computation (e.g., it cannot be replaced
  with value predictions and/or control dependencies); if that's the case
  in the program, we can reasonably assume that a compiler implementation
  will transform this into a data dependency, which will then lead to
  ordering guarantees by the HW.
  
  However, we need to specify when a value is necessary.  We could say
  that this is implementation-defined, and use a set of litmus tests
  (e.g., like those discussed in the thread) to roughly carve out what a
  programmer could expect.  This may even be practical for a project like
  the Linux kernel that follows strict project-internal rules and pays a
  lot of attention to what the particular implementations of compilers
  expected to compile the kernel are doing.  However, I think this
  approach would be too vague for the standard and for many other
  programs/projects.
 
 I agree that a number of other projects would have more need for this than
 might the kernel.  Please understand that this is in no way denigrating
 the intelligence of other projects' members.  It is just that many of
 them have only recently started seriously thinking about concurrency.
 In contrast, the Linux kernel community has been doing concurrency since
 the mid-1990s.  Projects with less experience with concurrency will
 probably need more help, from the compiler 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 3.  The comparison was against another RCU-protected pointer,
 where that other pointer was properly fetched using one
 of the RCU primitives.  Here it doesn't matter which pointer
 you use.  At least as long as the rcu_assign_pointer() for
 that other pointer happened after the last update to the
 pointed-to structure.

 I am a bit nervous about #3.  Any thoughts on it?

I think that it might be worth pointing out as an example, and saying
that code like

   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p-val;

then the access of p-val is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.

I cannot for the life of me come up with a situation where this would
matter, though. If X contains a fence, then that fence will be a
stronger ordering than anything the consume through p would
guarantee anyway. And if X does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through p is through p or q is kind of
irrelevant. No?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which pointer
  you use.  At least as long as the rcu_assign_pointer() for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?
 
 I think that it might be worth pointing out as an example, and saying
 that code like
 
p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;
 
 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the compiler
 can decide that the values are interchangeable.
 
 I cannot for the life of me come up with a situation where this would
 matter, though. If X contains a fence, then that fence will be a
 stronger ordering than anything the consume through p would
 guarantee anyway. And if X does *not* contain a fence, then the
 atomic reads of p and q are unordered *anyway*, so then whether the
 ordering to the access through p is through p or q is kind of
 irrelevant. No?

I can make a contrived litmus test for it, but you are right, the only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU can
reorder the loads into p and q, and the read from p-val can, as you say,
come from either pointer.

For whatever it is worth, hear is the litmus test:

T1: p = kmalloc(...);
if (p == NULL)
deal_with_it();
p-a = 42;  /* Each field in its own cache line. */
p-b = 43;
p-c = 44;
atomic_store_explicit(gp1, p, memory_order_release);
p-b = 143;
p-c = 144;
atomic_store_explicit(gp2, p, memory_order_release);

T2: p = atomic_load_explicit(gp2, memory_order_consume);
r1 = p-b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q-c is same as p-c. */
r2 = p-c; /* Could get 44 on weakly order system. */
}

The loads from gp1 and gp2 are, as you say, unordered, so you get what
you get.

And publishing a structure via one RCU-protected pointer, updating it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and still
see consistency across all the fields in the structure, please put a lock
in the structure and use it to guard updates and accesses to those fields.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
  
  On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
   On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
Good points.  How about the following replacements?
   
3.  Adding or subtracting an integer to/from a chained pointer
results in another chained pointer in that same pointer chain.
The results of addition and subtraction operations that cancel
the chained pointer's value (for example, p-(long)p where p
is a pointer to char) are implementation defined.
   
4.  Bitwise operators (, |, ^, and I suppose also ~)
applied to a chained pointer and an integer for the purposes
of alignment and pointer translation results in another
chained pointer in that same pointer chain.  Other uses
of bitwise operators on chained pointers (for example,
p|~0) are implementation defined.
   
   Quite frankly, I think all of this language that is about the actual
   operations is irrelevant and wrong.
   
   It's not going to help compiler writers, and it sure isn't going to
   help users that read this.
   
   Why not just talk about value chains and that any operations that
   restrict the value range severely end up breaking the chain. There is
   no point in listing the operations individually, because every single
   operation *can* restrict things. Listing individual operations and
   depdendencies is just fundamentally wrong.
  
  [...]
  
   The *only* thing that matters for all of them is whether they are
   value-preserving, or whether they drop so much information that the
   compiler might decide to use a control dependency instead. That's true
   for every single one of them.
   
   Similarly, actual true control dependencies that limit the problem
   space sufficiently that the actual pointer value no longer has
   significant information in it (see the above example) are also things
   that remove information to the point that only a control dependency
   remains. Even when the value itself is not modified in any way at all.
  
  I agree that just considering syntactic properties of the program seems
  to be insufficient.  Making it instead depend on whether there is a
  semantic dependency due to a value being necessary to compute a
  result seems better.  However, whether a value is necessary might not
  be obvious, and I understand Paul's argument that he does not want to
  have to reason about all potential compiler optimizations.  Thus, I
  believe we need to specify when a value is necessary.
  
  I have a suggestion for a somewhat different formulation of the feature
  that you seem to have in mind, which I'll discuss below.  Excuse the
  verbosity of the following, but I'd rather like to avoid
  misunderstandings than save a few words.
 
 Thank you very much for putting this forward!  I must confess that I was
 stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
 is quite clearly way bogus.
 
 One possible saving grace:  From discussions at the standards committee
 meeting a few weeks ago, there is a some chance that the committee will
 be willing to do a rip-and-replace on the current memory_order_consume
 wording, without provisions for backwards compatibility with the current
 bogosity.
 
  What we'd like to capture is that a value originating from a mo_consume
  load is necessary for a computation (e.g., it cannot be replaced
  with value predictions and/or control dependencies); if that's the case
  in the program, we can reasonably assume that a compiler implementation
  will transform this into a data dependency, which will then lead to
  ordering guarantees by the HW.
  
  However, we need to specify when a value is necessary.  We could say
  that this is implementation-defined, and use a set of litmus tests
  (e.g., like those discussed in the thread) to roughly carve out what a
  programmer could expect.  This may even be practical for a project like
  the Linux kernel that follows strict project-internal rules and pays a
  lot of attention to what the particular implementations of compilers
  expected to compile the kernel are doing.  However, I think this
  approach would be too vague for the standard and for many other
  programs/projects.
 
 I agree that a number of other projects would have more need for this than
 might the kernel.  Please understand that this is in no way denigrating
 the intelligence of other projects' members.  It is just that many of
 them have only recently started seriously thinking about concurrency.
 In contrast, the Linux kernel community has been doing concurrency since
 the mid-1990s.  Projects with less experience with concurrency will
 probably need more help, from the compiler 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Wed, 2014-02-26 at 18:43 +, Joseph S. Myers wrote:
> On Wed, 26 Feb 2014, Torvald Riegel wrote:
> 
> > On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
> > > On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> > > 
> > > > This needs to be as follows:
> > > > 
> > > > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > > > {
> > > > return i - i;
> > > > }
> > > > 
> > > > Otherwise dependencies won't get carried through it.
> > > 
> > > C11 doesn't have attributes at all (and no specification regarding calls 
> > > and dependencies that I can see).  And the way I read the C++11 
> > > specification of carries_dependency is that specifying carries_dependency 
> > > is purely about increasing optimization of the caller: that if it isn't 
> > > specified, then the caller doesn't know what dependencies might be 
> > > carried.  "Note: The carries_dependency attribute does not change the 
> > > meaning of the program, but may result in generation of more efficient 
> > > code. - end note".
> > 
> > I think that this last sentence can be kind of misleading, especially
> > when looking at it from an implementation point of view.  How
> > dependencies are handled (ie, preserving the syntactic dependencies vs.
> > emitting barriers) must be part of the ABI, or things like
> > [[carries_dependency]] won't work as expected (or lead to inefficient
> > code).  Thus, in practice, all compiler vendors on a platform would have
> > to agree to a particular handling, which might end up in selecting the
> > easy-but-conservative implementation option (ie, always emitting
> > mo_acquire when the source uses mo_consume).
> 
> Regardless of the ABI, my point is that if a program is valid, it is also 
> valid when all uses of [[carries_dependency]] are removed.  If a function 
> doesn't use [[carries_dependency]], that means "dependencies may or may 
> not be carried through this function".  If a function uses 
> [[carries_dependency]], that means that certain dependencies *are* carried 
> through the function (and the ABI should then specify what this means the 
> caller can rely on, in terms of the architecture's memory model).  (This 
> may or may not be useful, but it's how I understand C++11.)

I agree.  What I tried to point out is that it's not the case that an
*implementation* can just ignore [[carries_dependency]].  So from an
implementation perspective, the attribute does have semantics.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Joseph S. Myers
On Wed, 26 Feb 2014, Torvald Riegel wrote:

> On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
> > On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> > 
> > > This needs to be as follows:
> > > 
> > > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > > {
> > >   return i - i;
> > > }
> > > 
> > > Otherwise dependencies won't get carried through it.
> > 
> > C11 doesn't have attributes at all (and no specification regarding calls 
> > and dependencies that I can see).  And the way I read the C++11 
> > specification of carries_dependency is that specifying carries_dependency 
> > is purely about increasing optimization of the caller: that if it isn't 
> > specified, then the caller doesn't know what dependencies might be 
> > carried.  "Note: The carries_dependency attribute does not change the 
> > meaning of the program, but may result in generation of more efficient 
> > code. - end note".
> 
> I think that this last sentence can be kind of misleading, especially
> when looking at it from an implementation point of view.  How
> dependencies are handled (ie, preserving the syntactic dependencies vs.
> emitting barriers) must be part of the ABI, or things like
> [[carries_dependency]] won't work as expected (or lead to inefficient
> code).  Thus, in practice, all compiler vendors on a platform would have
> to agree to a particular handling, which might end up in selecting the
> easy-but-conservative implementation option (ie, always emitting
> mo_acquire when the source uses mo_consume).

Regardless of the ABI, my point is that if a program is valid, it is also 
valid when all uses of [[carries_dependency]] are removed.  If a function 
doesn't use [[carries_dependency]], that means "dependencies may or may 
not be carried through this function".  If a function uses 
[[carries_dependency]], that means that certain dependencies *are* carried 
through the function (and the ABI should then specify what this means the 
caller can rely on, in terms of the architecture's memory model).  (This 
may or may not be useful, but it's how I understand C++11.)

-- 
Joseph S. Myers
jos...@codesourcery.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Paul E. McKenney
On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140226130517.3...@vmsdvma.vnet.ibm.com
> X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> 
> On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> > > Hi,
> > > 
> > > On Thu, 20 Feb 2014, Linus Torvalds wrote:
> > > 
> > > > But I'm pretty sure that any compiler guy must *hate* that current odd
> > > > dependency-generation part, and if I was a gcc person, seeing that
> > > > bugzilla entry Torvald pointed at, I would personally want to
> > > > dismember somebody with a rusty spoon..
> > > 
> > > Yes.  Defining dependency chains in the way the standard currently seems 
> > > to do must come from people not writing compilers.  There's simply no 
> > > sensible way to implement it without being really conservative, because 
> > > the depchains can contain arbitrary constructs including stores, 
> > > loads and function calls but must still be observed.  
> > > 
> > > And with conservative I mean "everything is a source of a dependency, and 
> > > hence can't be removed, reordered or otherwise fiddled with", and that 
> > > includes code sequences where no atomic objects are anywhere in sight [1].
> > > In the light of that the only realistic way (meaning to not have to 
> > > disable optimization everywhere) to implement consume as currently 
> > > specified is to map it to acquire.  At which point it becomes pointless.
> > 
> > No, only memory_order_consume loads and [[carries_dependency]]
> > function arguments are sources of dependency chains.
> 
> However, that is, given how the standard specifies things, just one of
> the possible ways for how an implementation can handle this.  Treating
> [[carries_dependency]] as a "necessary" annotation to make exploiting
> mo_consume work in practice is possible, but it's not required by the
> standard.
> 
> Also, dependencies are specified to flow through loads and stores
> (restricted to scalar objects and bitfields), so any load that might
> load from a dependency-carrying store can also be a source (and that
> doesn't seem to be restricted by [[carries_dependency]]).

OK, this last is clearly totally unacceptable.  :-/

Leaving aside the option of dropping the whole thing for the moment,
the only thing that suggests itself is having all dependencies die at
a specific point in the code, corresponding to the rcu_read_unlock().
But as far as I can see, that absolutely requires "necessary" parameter
and return marking in order to correctly handle nested RCU read-side
critical sections in different functions.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote:
> On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Mon, 24 Feb 2014, Linus Torvalds wrote:
> > 
> > > > To me that reads like
> > > >
> > > >   int i;
> > > >   int *q = 
> > > >   int **p = 
> > > >
> > > >   atomic_XXX (p, CONSUME);
> > > >
> > > > orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> > > > want to say that it orders against aliased storage - but then go further
> > > > and include "indirectly through a chain of pointers"?!  Thus an
> > > > atomic read of a int * orders against any 'int' memory operation but
> > > > not against 'float' memory operations?
> > > 
> > > No, it's not about type at all, and the "chain of pointers" can be
> > > much more complex than that, since the "int *" can point to within an
> > > object that contains other things than just that "int" (the "int" can
> > > be part of a structure that then has pointers to other structures
> > > etc).
> > 
> > So, let me try to poke holes into your definition or increase my 
> > understanding :) .  You said "chain of pointers"(dereferences I assume), 
> > e.g. if p is result of consume load, then access to 
> > p->here->there->next->prev->stuff is supposed to be ordered with that load 
> > (or only when that last load/store itself is also an atomic load or 
> > store?).
> > 
> > So, what happens if the pointer deref chain is partly hidden in some 
> > functions:
> > 
> > A * adjustptr (B *ptr) { return >here->there->next; }
> > B * p = atomic_XXX (, consume);
> > adjustptr(p)->prev->stuff = bla;
> > 
> > As far as I understood you, this whole ptrderef chain business would be 
> > only an optimization opportunity, right?  So if the compiler can't be sure 
> > how p is actually used (as in my function-using case, assume adjustptr is 
> > defined in another unit), then the consume load would simply be 
> > transformed into an acquire (or whatever, with some barrier I mean)?  Only 
> > _if_ the compiler sees all obvious uses of p (indirectly through pointer 
> > derefs) can it, yeah, do what with the consume load?
> 
> Good point, I left that out of my list.  Adding it:
> 
> 13.   By default, pointer chains do not propagate into or out of functions.
>   In implementations having attributes, a [[carries_dependency]]
>   may be used to mark a function argument or return as passing
>   a pointer chain into or out of that function.
> 
>   If a function does not contain memory_order_consume loads and
>   also does not contain [[carries_dependency]] attributes, then
>   that function may be compiled using any desired dependency-breaking
>   optimizations.
> 
>   The ordering effects are implementation defined when a given
>   pointer chain passes into or out of a function through a parameter
>   or return not marked with a [[carries_dependency]] attributed.
> 
> Note that this last paragraph differs from the current standard, which
> would require ordering regardless.

I would prefer if we could get rid off [[carries_dependency]] as well;
currently, it's a hint whose effectiveness really depends on how the
particular implementation handles this attribute.  If we still need
something like it in the future, it would be good if it had a clearer
use and performance effects.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz  wrote:
> >
> > So, let me try to poke holes into your definition or increase my
> > understanding :) .  You said "chain of pointers"(dereferences I assume),
> > e.g. if p is result of consume load, then access to
> > p->here->there->next->prev->stuff is supposed to be ordered with that load
> > (or only when that last load/store itself is also an atomic load or
> > store?).
> 
> It's supposed to be ordered wrt the first load (the consuming one), yes.
> 
> > So, what happens if the pointer deref chain is partly hidden in some
> > functions:
> 
> No problem.
> 
> The thing is, the ordering is actually handled by the CPU in all
> relevant cases.  So the compiler doesn't actually need to *do*
> anything. All this legalistic stuff is just to describe the semantics
> and the guarantees.
> 
> The problem is two cases:
> 
>  (a) alpha (which doesn't really order any accesses at all, not even
> dependent loads), but for a compiler alpha is actually trivial: just
> add a "rmb" instruction after the load, and you can't really do
> anything else (there's a few optimizations you can do wrt the rmb, but
> they are very specific and simple).
> 
> So (a) is a "problem", but the solution is actually really simple, and
> gives very *strong* guarantees: on alpha, a "consume" ends up being
> basically the same as a read barrier after the load, with only very
> minimal room for optimization.
> 
>  (b) ARM and powerpc and similar architectures, that guarantee the
> data dependency as long as it is an *actual* data dependency, and
> never becomes a control dependency.
> 
> On ARM and powerpc, control dependencies do *not* order accesses (the
> reasons boil down to essentially: branch prediction breaks the
> dependency, and instructions that come after the branch can be happily
> executed before the branch). But it's almost impossible to describe
> that in the standard, since compilers can (and very much do) turn a
> control dependency into a data dependency and vice versa.
> 
> So the current standard tries to describe that "control vs data"
> dependency, and tries to limit it to a data dependency. It fails. It
> fails for multiple reasons - it doesn't allow for trivial
> optimizations that just remove the data dependency, and it also
> doesn't allow for various trivial cases where the compiler *does* turn
> the data dependency into a control dependency.
> 
> So I really really think that the current C standard language is
> broken. Unfixably broken.
> 
> I'm trying to change the "syntactic data dependency" that the current
> standard uses into something that is clearer and correct.
> 
> The "chain of pointers" thing is still obviously a data dependency,
> but by limiting it to pointers, it simplifies the language, clarifies
> the meaning, avoids all syntactic tricks (ie "p-p" is clearly a
> syntactic dependency on "p", but does *not* involve in any way
> following the pointer) and makes it basically impossible for the
> compiler to break the dependency without doing value prediction, and
> since value prediction has to be disallowed anyway, that's a feature,
> not a bug.

AFAIU, Michael is wondering about how we can separate non-synchronizing
code (ie, in this case, not taking part in any "chain of pointers" used
with mo_consume loads) from code that does.  If we cannot, then we
prevent value prediction *everywhere*, unless the compiler can prove
that the code is never part of such a chain (which is hard due to alias
analysis being hard, etc.).
(We can probably argue to which extent value prediction is necessary for
generation of efficient code, but it obviously does work in
non-synchronizing code (or even with acquire barriers with some care) --
so forbidding it entirely might be bad.)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
> On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> 
> > This needs to be as follows:
> > 
> > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > {
> > return i - i;
> > }
> > 
> > Otherwise dependencies won't get carried through it.
> 
> C11 doesn't have attributes at all (and no specification regarding calls 
> and dependencies that I can see).  And the way I read the C++11 
> specification of carries_dependency is that specifying carries_dependency 
> is purely about increasing optimization of the caller: that if it isn't 
> specified, then the caller doesn't know what dependencies might be 
> carried.  "Note: The carries_dependency attribute does not change the 
> meaning of the program, but may result in generation of more efficient 
> code. - end note".

I think that this last sentence can be kind of misleading, especially
when looking at it from an implementation point of view.  How
dependencies are handled (ie, preserving the syntactic dependencies vs.
emitting barriers) must be part of the ABI, or things like
[[carries_dependency]] won't work as expected (or lead to inefficient
code).  Thus, in practice, all compiler vendors on a platform would have
to agree to a particular handling, which might end up in selecting the
easy-but-conservative implementation option (ie, always emitting
mo_acquire when the source uses mo_consume).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
> On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> > Hi,
> > 
> > On Thu, 20 Feb 2014, Linus Torvalds wrote:
> > 
> > > But I'm pretty sure that any compiler guy must *hate* that current odd
> > > dependency-generation part, and if I was a gcc person, seeing that
> > > bugzilla entry Torvald pointed at, I would personally want to
> > > dismember somebody with a rusty spoon..
> > 
> > Yes.  Defining dependency chains in the way the standard currently seems 
> > to do must come from people not writing compilers.  There's simply no 
> > sensible way to implement it without being really conservative, because 
> > the depchains can contain arbitrary constructs including stores, 
> > loads and function calls but must still be observed.  
> > 
> > And with conservative I mean "everything is a source of a dependency, and 
> > hence can't be removed, reordered or otherwise fiddled with", and that 
> > includes code sequences where no atomic objects are anywhere in sight [1].
> > In the light of that the only realistic way (meaning to not have to 
> > disable optimization everywhere) to implement consume as currently 
> > specified is to map it to acquire.  At which point it becomes pointless.
> 
> No, only memory_order_consume loads and [[carries_dependency]]
> function arguments are sources of dependency chains.

However, that is, given how the standard specifies things, just one of
the possible ways for how an implementation can handle this.  Treating
[[carries_dependency]] as a "necessary" annotation to make exploiting
mo_consume work in practice is possible, but it's not required by the
standard.

Also, dependencies are specified to flow through loads and stores
(restricted to scalar objects and bitfields), so any load that might
load from a dependency-carrying store can also be a source (and that
doesn't seem to be restricted by [[carries_dependency]]).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
 On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
  Hi,
  
  On Thu, 20 Feb 2014, Linus Torvalds wrote:
  
   But I'm pretty sure that any compiler guy must *hate* that current odd
   dependency-generation part, and if I was a gcc person, seeing that
   bugzilla entry Torvald pointed at, I would personally want to
   dismember somebody with a rusty spoon..
  
  Yes.  Defining dependency chains in the way the standard currently seems 
  to do must come from people not writing compilers.  There's simply no 
  sensible way to implement it without being really conservative, because 
  the depchains can contain arbitrary constructs including stores, 
  loads and function calls but must still be observed.  
  
  And with conservative I mean everything is a source of a dependency, and 
  hence can't be removed, reordered or otherwise fiddled with, and that 
  includes code sequences where no atomic objects are anywhere in sight [1].
  In the light of that the only realistic way (meaning to not have to 
  disable optimization everywhere) to implement consume as currently 
  specified is to map it to acquire.  At which point it becomes pointless.
 
 No, only memory_order_consume loads and [[carries_dependency]]
 function arguments are sources of dependency chains.

However, that is, given how the standard specifies things, just one of
the possible ways for how an implementation can handle this.  Treating
[[carries_dependency]] as a necessary annotation to make exploiting
mo_consume work in practice is possible, but it's not required by the
standard.

Also, dependencies are specified to flow through loads and stores
(restricted to scalar objects and bitfields), so any load that might
load from a dependency-carrying store can also be a source (and that
doesn't seem to be restricted by [[carries_dependency]]).

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
 On Fri, 21 Feb 2014, Paul E. McKenney wrote:
 
  This needs to be as follows:
  
  [[carries_dependency]] int getzero(int i [[carries_dependency]])
  {
  return i - i;
  }
  
  Otherwise dependencies won't get carried through it.
 
 C11 doesn't have attributes at all (and no specification regarding calls 
 and dependencies that I can see).  And the way I read the C++11 
 specification of carries_dependency is that specifying carries_dependency 
 is purely about increasing optimization of the caller: that if it isn't 
 specified, then the caller doesn't know what dependencies might be 
 carried.  Note: The carries_dependency attribute does not change the 
 meaning of the program, but may result in generation of more efficient 
 code. - end note.

I think that this last sentence can be kind of misleading, especially
when looking at it from an implementation point of view.  How
dependencies are handled (ie, preserving the syntactic dependencies vs.
emitting barriers) must be part of the ABI, or things like
[[carries_dependency]] won't work as expected (or lead to inefficient
code).  Thus, in practice, all compiler vendors on a platform would have
to agree to a particular handling, which might end up in selecting the
easy-but-conservative implementation option (ie, always emitting
mo_acquire when the source uses mo_consume).

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz m...@suse.de wrote:
 
  So, let me try to poke holes into your definition or increase my
  understanding :) .  You said chain of pointers(dereferences I assume),
  e.g. if p is result of consume load, then access to
  p-here-there-next-prev-stuff is supposed to be ordered with that load
  (or only when that last load/store itself is also an atomic load or
  store?).
 
 It's supposed to be ordered wrt the first load (the consuming one), yes.
 
  So, what happens if the pointer deref chain is partly hidden in some
  functions:
 
 No problem.
 
 The thing is, the ordering is actually handled by the CPU in all
 relevant cases.  So the compiler doesn't actually need to *do*
 anything. All this legalistic stuff is just to describe the semantics
 and the guarantees.
 
 The problem is two cases:
 
  (a) alpha (which doesn't really order any accesses at all, not even
 dependent loads), but for a compiler alpha is actually trivial: just
 add a rmb instruction after the load, and you can't really do
 anything else (there's a few optimizations you can do wrt the rmb, but
 they are very specific and simple).
 
 So (a) is a problem, but the solution is actually really simple, and
 gives very *strong* guarantees: on alpha, a consume ends up being
 basically the same as a read barrier after the load, with only very
 minimal room for optimization.
 
  (b) ARM and powerpc and similar architectures, that guarantee the
 data dependency as long as it is an *actual* data dependency, and
 never becomes a control dependency.
 
 On ARM and powerpc, control dependencies do *not* order accesses (the
 reasons boil down to essentially: branch prediction breaks the
 dependency, and instructions that come after the branch can be happily
 executed before the branch). But it's almost impossible to describe
 that in the standard, since compilers can (and very much do) turn a
 control dependency into a data dependency and vice versa.
 
 So the current standard tries to describe that control vs data
 dependency, and tries to limit it to a data dependency. It fails. It
 fails for multiple reasons - it doesn't allow for trivial
 optimizations that just remove the data dependency, and it also
 doesn't allow for various trivial cases where the compiler *does* turn
 the data dependency into a control dependency.
 
 So I really really think that the current C standard language is
 broken. Unfixably broken.
 
 I'm trying to change the syntactic data dependency that the current
 standard uses into something that is clearer and correct.
 
 The chain of pointers thing is still obviously a data dependency,
 but by limiting it to pointers, it simplifies the language, clarifies
 the meaning, avoids all syntactic tricks (ie p-p is clearly a
 syntactic dependency on p, but does *not* involve in any way
 following the pointer) and makes it basically impossible for the
 compiler to break the dependency without doing value prediction, and
 since value prediction has to be disallowed anyway, that's a feature,
 not a bug.

AFAIU, Michael is wondering about how we can separate non-synchronizing
code (ie, in this case, not taking part in any chain of pointers used
with mo_consume loads) from code that does.  If we cannot, then we
prevent value prediction *everywhere*, unless the compiler can prove
that the code is never part of such a chain (which is hard due to alias
analysis being hard, etc.).
(We can probably argue to which extent value prediction is necessary for
generation of efficient code, but it obviously does work in
non-synchronizing code (or even with acquire barriers with some care) --
so forbidding it entirely might be bad.)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote:
 On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
  Hi,
  
  On Mon, 24 Feb 2014, Linus Torvalds wrote:
  
To me that reads like
   
  int i;
  int *q = i;
  int **p = q;
   
  atomic_XXX (p, CONSUME);
   
orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
want to say that it orders against aliased storage - but then go further
and include indirectly through a chain of pointers?!  Thus an
atomic read of a int * orders against any 'int' memory operation but
not against 'float' memory operations?
   
   No, it's not about type at all, and the chain of pointers can be
   much more complex than that, since the int * can point to within an
   object that contains other things than just that int (the int can
   be part of a structure that then has pointers to other structures
   etc).
  
  So, let me try to poke holes into your definition or increase my 
  understanding :) .  You said chain of pointers(dereferences I assume), 
  e.g. if p is result of consume load, then access to 
  p-here-there-next-prev-stuff is supposed to be ordered with that load 
  (or only when that last load/store itself is also an atomic load or 
  store?).
  
  So, what happens if the pointer deref chain is partly hidden in some 
  functions:
  
  A * adjustptr (B *ptr) { return ptr-here-there-next; }
  B * p = atomic_XXX (somewhere, consume);
  adjustptr(p)-prev-stuff = bla;
  
  As far as I understood you, this whole ptrderef chain business would be 
  only an optimization opportunity, right?  So if the compiler can't be sure 
  how p is actually used (as in my function-using case, assume adjustptr is 
  defined in another unit), then the consume load would simply be 
  transformed into an acquire (or whatever, with some barrier I mean)?  Only 
  _if_ the compiler sees all obvious uses of p (indirectly through pointer 
  derefs) can it, yeah, do what with the consume load?
 
 Good point, I left that out of my list.  Adding it:
 
 13.   By default, pointer chains do not propagate into or out of functions.
   In implementations having attributes, a [[carries_dependency]]
   may be used to mark a function argument or return as passing
   a pointer chain into or out of that function.
 
   If a function does not contain memory_order_consume loads and
   also does not contain [[carries_dependency]] attributes, then
   that function may be compiled using any desired dependency-breaking
   optimizations.
 
   The ordering effects are implementation defined when a given
   pointer chain passes into or out of a function through a parameter
   or return not marked with a [[carries_dependency]] attributed.
 
 Note that this last paragraph differs from the current standard, which
 would require ordering regardless.

I would prefer if we could get rid off [[carries_dependency]] as well;
currently, it's a hint whose effectiveness really depends on how the
particular implementation handles this attribute.  If we still need
something like it in the future, it would be good if it had a clearer
use and performance effects.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Paul E. McKenney
On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140226130517.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
  On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
   Hi,
   
   On Thu, 20 Feb 2014, Linus Torvalds wrote:
   
But I'm pretty sure that any compiler guy must *hate* that current odd
dependency-generation part, and if I was a gcc person, seeing that
bugzilla entry Torvald pointed at, I would personally want to
dismember somebody with a rusty spoon..
   
   Yes.  Defining dependency chains in the way the standard currently seems 
   to do must come from people not writing compilers.  There's simply no 
   sensible way to implement it without being really conservative, because 
   the depchains can contain arbitrary constructs including stores, 
   loads and function calls but must still be observed.  
   
   And with conservative I mean everything is a source of a dependency, and 
   hence can't be removed, reordered or otherwise fiddled with, and that 
   includes code sequences where no atomic objects are anywhere in sight [1].
   In the light of that the only realistic way (meaning to not have to 
   disable optimization everywhere) to implement consume as currently 
   specified is to map it to acquire.  At which point it becomes pointless.
  
  No, only memory_order_consume loads and [[carries_dependency]]
  function arguments are sources of dependency chains.
 
 However, that is, given how the standard specifies things, just one of
 the possible ways for how an implementation can handle this.  Treating
 [[carries_dependency]] as a necessary annotation to make exploiting
 mo_consume work in practice is possible, but it's not required by the
 standard.
 
 Also, dependencies are specified to flow through loads and stores
 (restricted to scalar objects and bitfields), so any load that might
 load from a dependency-carrying store can also be a source (and that
 doesn't seem to be restricted by [[carries_dependency]]).

OK, this last is clearly totally unacceptable.  :-/

Leaving aside the option of dropping the whole thing for the moment,
the only thing that suggests itself is having all dependencies die at
a specific point in the code, corresponding to the rcu_read_unlock().
But as far as I can see, that absolutely requires necessary parameter
and return marking in order to correctly handle nested RCU read-side
critical sections in different functions.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Joseph S. Myers
On Wed, 26 Feb 2014, Torvald Riegel wrote:

 On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
  On Fri, 21 Feb 2014, Paul E. McKenney wrote:
  
   This needs to be as follows:
   
   [[carries_dependency]] int getzero(int i [[carries_dependency]])
   {
 return i - i;
   }
   
   Otherwise dependencies won't get carried through it.
  
  C11 doesn't have attributes at all (and no specification regarding calls 
  and dependencies that I can see).  And the way I read the C++11 
  specification of carries_dependency is that specifying carries_dependency 
  is purely about increasing optimization of the caller: that if it isn't 
  specified, then the caller doesn't know what dependencies might be 
  carried.  Note: The carries_dependency attribute does not change the 
  meaning of the program, but may result in generation of more efficient 
  code. - end note.
 
 I think that this last sentence can be kind of misleading, especially
 when looking at it from an implementation point of view.  How
 dependencies are handled (ie, preserving the syntactic dependencies vs.
 emitting barriers) must be part of the ABI, or things like
 [[carries_dependency]] won't work as expected (or lead to inefficient
 code).  Thus, in practice, all compiler vendors on a platform would have
 to agree to a particular handling, which might end up in selecting the
 easy-but-conservative implementation option (ie, always emitting
 mo_acquire when the source uses mo_consume).

Regardless of the ABI, my point is that if a program is valid, it is also 
valid when all uses of [[carries_dependency]] are removed.  If a function 
doesn't use [[carries_dependency]], that means dependencies may or may 
not be carried through this function.  If a function uses 
[[carries_dependency]], that means that certain dependencies *are* carried 
through the function (and the ABI should then specify what this means the 
caller can rely on, in terms of the architecture's memory model).  (This 
may or may not be useful, but it's how I understand C++11.)

-- 
Joseph S. Myers
jos...@codesourcery.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Torvald Riegel
On Wed, 2014-02-26 at 18:43 +, Joseph S. Myers wrote:
 On Wed, 26 Feb 2014, Torvald Riegel wrote:
 
  On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote:
   On Fri, 21 Feb 2014, Paul E. McKenney wrote:
   
This needs to be as follows:

[[carries_dependency]] int getzero(int i [[carries_dependency]])
{
return i - i;
}

Otherwise dependencies won't get carried through it.
   
   C11 doesn't have attributes at all (and no specification regarding calls 
   and dependencies that I can see).  And the way I read the C++11 
   specification of carries_dependency is that specifying carries_dependency 
   is purely about increasing optimization of the caller: that if it isn't 
   specified, then the caller doesn't know what dependencies might be 
   carried.  Note: The carries_dependency attribute does not change the 
   meaning of the program, but may result in generation of more efficient 
   code. - end note.
  
  I think that this last sentence can be kind of misleading, especially
  when looking at it from an implementation point of view.  How
  dependencies are handled (ie, preserving the syntactic dependencies vs.
  emitting barriers) must be part of the ABI, or things like
  [[carries_dependency]] won't work as expected (or lead to inefficient
  code).  Thus, in practice, all compiler vendors on a platform would have
  to agree to a particular handling, which might end up in selecting the
  easy-but-conservative implementation option (ie, always emitting
  mo_acquire when the source uses mo_consume).
 
 Regardless of the ABI, my point is that if a program is valid, it is also 
 valid when all uses of [[carries_dependency]] are removed.  If a function 
 doesn't use [[carries_dependency]], that means dependencies may or may 
 not be carried through this function.  If a function uses 
 [[carries_dependency]], that means that certain dependencies *are* carried 
 through the function (and the ABI should then specify what this means the 
 caller can rely on, in terms of the architecture's memory model).  (This 
 may or may not be useful, but it's how I understand C++11.)

I agree.  What I tried to point out is that it's not the case that an
*implementation* can just ignore [[carries_dependency]].  So from an
implementation perspective, the attribute does have semantics.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote:
> On 02/25/14 17:15, Paul E. McKenney wrote:
> >>I have for the last several years been 100% convinced that the Intel
> >>memory ordering is the right thing, and that people who like weak
> >>memory ordering are wrong and should try to avoid reproducing if at
> >>all possible. But given that we have memory orderings like power and
> >>ARM, I don't actually see a sane way to get a good strong ordering.
> >>You can teach compilers about cases like the above when they actually
> >>see all the code and they could poison the value chain etc. But it
> >>would be fairly painful, and once you cross object files (or even just
> >>functions in the same compilation unit, for that matter), it goes from
> >>painful to just "ridiculously not worth it".
> >
> >And I have indeed seen a post or two from you favoring stronger memory
> >ordering over the past few years.  ;-)
> I couldn't agree more.
> 
> >
> >Are ARM and Power really the bad boys here?  Or are they instead playing
> >the role of the canary in the coal mine?
> That's a question I've been struggling with recently as well.  I
> suspect they (arm, power) are going to be the outliers rather than
> the canary. While the weaker model may give them some advantages WRT
> scalability, I don't think it'll ultimately be enough to overcome
> the difficulty in writing correct low level code for them.
> 
> Regardless, they're here and we have to deal with them.

Agreed...

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote:
>  wrote:
> >  wrote:
> >> I have for the last several years been 100% convinced that the Intel
> >> memory ordering is the right thing, and that people who like weak
> >> memory ordering are wrong and should try to avoid reproducing if at
> >> all possible.
> >
> > Are ARM and Power really the bad boys here?  Or are they instead playing
> > the role of the canary in the coal mine?
> 
> To paraphrase some older threads, I think Linus's argument is that
> weak memory ordering is like branch delay slots: a way to make a simple
> implementation simpler, but ends up being no help to a more aggressive
> implementation.
> 
> Branch delay slots give a one-cycle bonus to in-order cores, but
> once you go superscalar and add branch prediction, they stop helping,
> and once you go full out of order, they're just an annoyance.
> 
> Likewise, I can see the point that weak ordering can help make a simple
> cache interface simpler, but once you start doing speculative loads,
> you've already bought and paid for all the hardware you need to do
> stronger coherency.
> 
> Another thing that requires all the strong-coherency machinery is
> a high-performance implementation of the various memory barrier and
> synchronization operations.  Yes, a low-performance (drain the pipeline)
> implementation is tolerable if the instructions aren't used frequently,
> but once you're really trying, it doesn't save complexity.
> 
> Once you're there, strong coherency always doesn't actually cost you any
> time outside of critical synchronization code, and it both simplifies
> and speeds up the tricky synchronization software.
> 
> 
> So PPC and ARM's weak ordering are not the direction the future is going.
> Rather, weak ordering is something that's only useful in a limited
> technology window, which is rapidly passing.

That does indeed appear to be Intel's story.  Might well be correct.
Time will tell.

> If you can find someone in IBM who's worked on the Z series cache
> coherency (extremely strong ordering), they probably have some useful
> insights.  The big question is if strong ordering, once you've accepted
> the implementation complexity and area, actually costs anything in
> execution time.  If there's an unavoidable cost which weak ordering saves,
> that's significant.

There has been a lot of ink spilled on this argument.  ;-)

PPC has much larger CPU counts than does the mainframe.  On the other
hand, there are large x86 systems.  Some claim that there are differences
in latency due to the different approaches, and there could be a long
argument about whether all this in inherent in the memory ordering or
whether it is due to implementation issues.

I don't claim to know the answer.  I do know that ARM and PPC are
here now, and that I need to deal with them.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
>  wrote:
> >
> > So let me see if I understand your reasoning.  My best guess is that it
> > goes something like this:
> >
> > 1.  The Linux kernel contains code that passes pointers from
> > rcu_dereference() through external functions.
> 
> No, actually, it's not so much Linux-specific at all.
> 
> I'm actually thinking about what I'd do as a compiler writer, and as a
> defender the "C is a high-level assembler" concept.
> 
> I love C. I'm a huge fan. I think it's a great language, and I think
> it's a great language not because of some theoretical issues, but
> because it is the only language around that actually maps fairly well
> to what machines really do.
> 
> And it's a *simple* language. Sure, it's not quite as simple as it
> used to be, but look at how thin the "K book" is. Which pretty much
> describes it - still.
> 
> That's the real strength of C, and why it's the only language serious
> people use for system programming.  Ignore C++ for a while (Jesus
> Xavier Christ, I've had to do C++ programming for subsurface), and
> just think about what makes _C_ a good language.

The last time I used C++ for a project was in 1990.  It was a lot smaller
then.

> I can look at C code, and I can understand what the code generation
> is, and what it will really *do*. And I think that's important.
> Abstractions that hide what the compiler will actually generate are
> bad abstractions.
> 
> And ok, so this is obviously Linux-specific in that it's generally
> only Linux where I really care about the code generation, but I do
> think it's a bigger issue too.
> 
> So I want C features to *map* to the hardware features they implement.
> The abstractions should match each other, not fight each other.

OK...

> > Actually, the fact that there are more potential optimizations than I can
> > think of is a big reason for my insistence on the carries-a-dependency
> > crap.  My lack of optimization omniscience makes me very nervous about
> > relying on there never ever being a reasonable way of computing a given
> > result without preserving the ordering.
> 
> But if I can give two clear examples that are basically identical from
> a syntactic standpoint, and one clearly can be trivially optimized to
> the point where the ordering guarantee goes away, and the other
> cannot, and you cannot describe the difference, then I think your
> description is seriously lacking.

In my defense, my plan was to constrain the compiler to retain the
ordering guarantee in either case.  Yes, I did notice that you find
that unacceptable.

> And I do *not* think the C language should be defined by how it can be
> described. Leave that to things like Haskell or LISP, where the goal
> is some kind of completeness of the language that is about the
> language, not about the machines it will run on.

I am with you up to the point that the fancy optimizers start kicking
in.  I don't know how to describe what the optimizers are and are not
permitted to do strictly in terms of the underlying hardware.

> >> So the code sequence I already mentioned is *not* ordered:
> >>
> >> Litmus test 1:
> >>
> >> p = atomic_read(pp, consume);
> >> if (p == )
> >> return p->val;
> >>
> >>is *NOT* ordered, because the compiler can trivially turn this into
> >> "return variable.val", and break the data dependency.
> >
> > Right, given your model, the compiler is free to produce code that
> > doesn't order the load from pp against the load from p->val.
> 
> Yes. Note also that that is what existing compilers would actually do.
> 
> And they'd do it "by mistake": they'd load the address of the variable
> into a register, and then compare the two registers, and then end up
> using _one_ of the registers as the base pointer for the "p->val"
> access, but I can almost *guarantee* that there are going to be
> sequences where some compiler will choose one register over the other
> based on some random detail.
> 
> So my model isn't just a "model", it also happens to descibe reality.

Sounds to me like your model -is- reality.  I believe that it is useful
to constrain reality from time to time, but understand that you vehemently
disagree.

> > Indeed, it won't work across different compilation units unless
> > the compiler is told about it, which is of course the whole point of
> > [[carries_dependency]].  Understood, though, the Linux kernel currently
> > does not have anything that could reasonably automatically generate those
> > [[carries_dependency]] attributes.  (Or are there other reasons why you
> > believe [[carries_dependency]] is problematic?)
> 
> So I think carries_dependency is problematic because:
> 
>  - it's not actually in C11 afaik

Indeed it is not, but I bet that gcc will implement it like it does the
other attributes that are not part of C11.

>  - it requires the programmer to solve the problem of 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Jeff Law

On 02/25/14 17:15, Paul E. McKenney wrote:

I have for the last several years been 100% convinced that the Intel
memory ordering is the right thing, and that people who like weak
memory ordering are wrong and should try to avoid reproducing if at
all possible. But given that we have memory orderings like power and
ARM, I don't actually see a sane way to get a good strong ordering.
You can teach compilers about cases like the above when they actually
see all the code and they could poison the value chain etc. But it
would be fairly painful, and once you cross object files (or even just
functions in the same compilation unit, for that matter), it goes from
painful to just "ridiculously not worth it".


And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

I couldn't agree more.



Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?
That's a question I've been struggling with recently as well.  I suspect 
they (arm, power) are going to be the outliers rather than the canary. 
While the weaker model may give them some advantages WRT scalability, I 
don't think it'll ultimately be enough to overcome the difficulty in 
writing correct low level code for them.


Regardless, they're here and we have to deal with them.


Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread George Spelvin
 wrote:
>  wrote:
>> I have for the last several years been 100% convinced that the Intel
>> memory ordering is the right thing, and that people who like weak
>> memory ordering are wrong and should try to avoid reproducing if at
>> all possible.
>
> Are ARM and Power really the bad boys here?  Or are they instead playing
> the role of the canary in the coal mine?

To paraphrase some older threads, I think Linus's argument is that
weak memory ordering is like branch delay slots: a way to make a simple
implementation simpler, but ends up being no help to a more aggressive
implementation.

Branch delay slots give a one-cycle bonus to in-order cores, but
once you go superscalar and add branch prediction, they stop helping,
and once you go full out of order, they're just an annoyance.

Likewise, I can see the point that weak ordering can help make a simple
cache interface simpler, but once you start doing speculative loads,
you've already bought and paid for all the hardware you need to do
stronger coherency.

Another thing that requires all the strong-coherency machinery is
a high-performance implementation of the various memory barrier and
synchronization operations.  Yes, a low-performance (drain the pipeline)
implementation is tolerable if the instructions aren't used frequently,
but once you're really trying, it doesn't save complexity.

Once you're there, strong coherency always doesn't actually cost you any
time outside of critical synchronization code, and it both simplifies
and speeds up the tricky synchronization software.


So PPC and ARM's weak ordering are not the direction the future is going.
Rather, weak ordering is something that's only useful in a limited
technology window, which is rapidly passing.

If you can find someone in IBM who's worked on the Z series cache
coherency (extremely strong ordering), they probably have some useful
insights.  The big question is if strong ordering, once you've accepted
the implementation complexity and area, actually costs anything in
execution time.  If there's an unavoidable cost which weak ordering saves,
that's significant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
 wrote:
>
> So let me see if I understand your reasoning.  My best guess is that it
> goes something like this:
>
> 1.  The Linux kernel contains code that passes pointers from
> rcu_dereference() through external functions.

No, actually, it's not so much Linux-specific at all.

I'm actually thinking about what I'd do as a compiler writer, and as a
defender the "C is a high-level assembler" concept.

I love C. I'm a huge fan. I think it's a great language, and I think
it's a great language not because of some theoretical issues, but
because it is the only language around that actually maps fairly well
to what machines really do.

And it's a *simple* language. Sure, it's not quite as simple as it
used to be, but look at how thin the "K book" is. Which pretty much
describes it - still.

That's the real strength of C, and why it's the only language serious
people use for system programming.  Ignore C++ for a while (Jesus
Xavier Christ, I've had to do C++ programming for subsurface), and
just think about what makes _C_ a good language.

I can look at C code, and I can understand what the code generation
is, and what it will really *do*. And I think that's important.
Abstractions that hide what the compiler will actually generate are
bad abstractions.

And ok, so this is obviously Linux-specific in that it's generally
only Linux where I really care about the code generation, but I do
think it's a bigger issue too.

So I want C features to *map* to the hardware features they implement.
The abstractions should match each other, not fight each other.

> Actually, the fact that there are more potential optimizations than I can
> think of is a big reason for my insistence on the carries-a-dependency
> crap.  My lack of optimization omniscience makes me very nervous about
> relying on there never ever being a reasonable way of computing a given
> result without preserving the ordering.

But if I can give two clear examples that are basically identical from
a syntactic standpoint, and one clearly can be trivially optimized to
the point where the ordering guarantee goes away, and the other
cannot, and you cannot describe the difference, then I think your
description is seriously lacking.

And I do *not* think the C language should be defined by how it can be
described. Leave that to things like Haskell or LISP, where the goal
is some kind of completeness of the language that is about the
language, not about the machines it will run on.

>> So the code sequence I already mentioned is *not* ordered:
>>
>> Litmus test 1:
>>
>> p = atomic_read(pp, consume);
>> if (p == )
>> return p->val;
>>
>>is *NOT* ordered, because the compiler can trivially turn this into
>> "return variable.val", and break the data dependency.
>
> Right, given your model, the compiler is free to produce code that
> doesn't order the load from pp against the load from p->val.

Yes. Note also that that is what existing compilers would actually do.

And they'd do it "by mistake": they'd load the address of the variable
into a register, and then compare the two registers, and then end up
using _one_ of the registers as the base pointer for the "p->val"
access, but I can almost *guarantee* that there are going to be
sequences where some compiler will choose one register over the other
based on some random detail.

So my model isn't just a "model", it also happens to descibe reality.

> Indeed, it won't work across different compilation units unless
> the compiler is told about it, which is of course the whole point of
> [[carries_dependency]].  Understood, though, the Linux kernel currently
> does not have anything that could reasonably automatically generate those
> [[carries_dependency]] attributes.  (Or are there other reasons why you
> believe [[carries_dependency]] is problematic?)

So I think carries_dependency is problematic because:

 - it's not actually in C11 afaik

 - it requires the programmer to solve the problem of the standard not
matching the hardware.

 - I think it's just insanely ugly, *especially* if it's actually
meant to work so that the current carries-a-dependency works even for
insane expressions like "a-a".

in practice, it's one of those things where I guess nobody actually
would ever use it.

> Of course, I cannot resist putting forward a third litmus test:
>
> static struct foo variable1;
> static struct foo variable2;
> static struct foo *pp = 
>
> T1: initialize_foo();
> atomic_store_explicit(, , memory_order_release);
> /* The above is the only store to pp in this translation unit,
>  * and the address of pp is not exported in any way.
>  */
>
> T2: if (p == )
> return p->val1; /* Must be variable1.val1. */
> else
> return p->val2; /* Must be variable2.val2. */
>
> My guess is that your approach would not provide ordering in this
> case, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
>  wrote:
> >
> > Litmus test 1:
> >
> > p = atomic_read(pp, consume);
> > if (p == )
> > return p->val;
> >
> >is *NOT* ordered
> 
> Btw, don't get me wrong. I don't _like_ it not being ordered, and I
> actually did spend some time thinking about my earlier proposal on
> strengthening the 'consume' ordering.

Understood.

> I have for the last several years been 100% convinced that the Intel
> memory ordering is the right thing, and that people who like weak
> memory ordering are wrong and should try to avoid reproducing if at
> all possible. But given that we have memory orderings like power and
> ARM, I don't actually see a sane way to get a good strong ordering.
> You can teach compilers about cases like the above when they actually
> see all the code and they could poison the value chain etc. But it
> would be fairly painful, and once you cross object files (or even just
> functions in the same compilation unit, for that matter), it goes from
> painful to just "ridiculously not worth it".

And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

> So I think the C semantics should mirror what the hardware gives us -
> and do so even in the face of reasonable optimizations - not try to do
> something else that requires compilers to treat "consume" very
> differently.

I am sure that a great many people would jump for joy at the chance to
drop any and all RCU-related verbiage from the C11 and C++11 standards.
(I know, you aren't necessarily advocating this, but given what you
say above, I cannot think what verbiage that would remain.)

The thing that makes me very nervous is how much the definition of
"reasonable optimization" has changed.  For example, before the
2.6.10 Linux kernel, we didn't even apply volatile semantics to
fetches of RCU-protected pointers -- and as far as I know, never
needed to.  But since then, there have been several cases where the
compiler happily hoisted a normal load out of a surprisingly large loop.
Hardware advances can come into play as well.  For example, my very first
RCU work back in the early 90s was on a parallel system whose CPUs had
no branch-prediction hardware (80386 or 80486, I don't remember which).
Now people talk about compilers using branch prediction hardware to
implement value-speculation optimizations.  Five or ten years from now,
who knows what crazy optimizations might be considered to be completely
reasonable?

Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?

> If people made me king of the world, I'd outlaw weak memory ordering.
> You can re-order as much as you want in hardware with speculation etc,
> but you should always *check* your speculation and make it *look* like
> you did everything in order. Which is pretty much the intel memory
> ordering (ignoring the write buffering).

Speaking as someone who got whacked over the head with DEC Alpha when
first presenting RCU to the Digital UNIX folks long ago, I do have some
sympathy with this line of thought.  But as you say, it is not the world
we currently live in.

Of course, in the final analysis, your kernel, your call.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Litmus test 1:
 
  p = atomic_read(pp, consume);
  if (p == variable)
  return p-val;
 
 is *NOT* ordered
 
 Btw, don't get me wrong. I don't _like_ it not being ordered, and I
 actually did spend some time thinking about my earlier proposal on
 strengthening the 'consume' ordering.

Understood.

 I have for the last several years been 100% convinced that the Intel
 memory ordering is the right thing, and that people who like weak
 memory ordering are wrong and should try to avoid reproducing if at
 all possible. But given that we have memory orderings like power and
 ARM, I don't actually see a sane way to get a good strong ordering.
 You can teach compilers about cases like the above when they actually
 see all the code and they could poison the value chain etc. But it
 would be fairly painful, and once you cross object files (or even just
 functions in the same compilation unit, for that matter), it goes from
 painful to just ridiculously not worth it.

And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

 So I think the C semantics should mirror what the hardware gives us -
 and do so even in the face of reasonable optimizations - not try to do
 something else that requires compilers to treat consume very
 differently.

I am sure that a great many people would jump for joy at the chance to
drop any and all RCU-related verbiage from the C11 and C++11 standards.
(I know, you aren't necessarily advocating this, but given what you
say above, I cannot think what verbiage that would remain.)

The thing that makes me very nervous is how much the definition of
reasonable optimization has changed.  For example, before the
2.6.10 Linux kernel, we didn't even apply volatile semantics to
fetches of RCU-protected pointers -- and as far as I know, never
needed to.  But since then, there have been several cases where the
compiler happily hoisted a normal load out of a surprisingly large loop.
Hardware advances can come into play as well.  For example, my very first
RCU work back in the early 90s was on a parallel system whose CPUs had
no branch-prediction hardware (80386 or 80486, I don't remember which).
Now people talk about compilers using branch prediction hardware to
implement value-speculation optimizations.  Five or ten years from now,
who knows what crazy optimizations might be considered to be completely
reasonable?

Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?

 If people made me king of the world, I'd outlaw weak memory ordering.
 You can re-order as much as you want in hardware with speculation etc,
 but you should always *check* your speculation and make it *look* like
 you did everything in order. Which is pretty much the intel memory
 ordering (ignoring the write buffering).

Speaking as someone who got whacked over the head with DEC Alpha when
first presenting RCU to the Digital UNIX folks long ago, I do have some
sympathy with this line of thought.  But as you say, it is not the world
we currently live in.

Of course, in the final analysis, your kernel, your call.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 So let me see if I understand your reasoning.  My best guess is that it
 goes something like this:

 1.  The Linux kernel contains code that passes pointers from
 rcu_dereference() through external functions.

No, actually, it's not so much Linux-specific at all.

I'm actually thinking about what I'd do as a compiler writer, and as a
defender the C is a high-level assembler concept.

I love C. I'm a huge fan. I think it's a great language, and I think
it's a great language not because of some theoretical issues, but
because it is the only language around that actually maps fairly well
to what machines really do.

And it's a *simple* language. Sure, it's not quite as simple as it
used to be, but look at how thin the KR book is. Which pretty much
describes it - still.

That's the real strength of C, and why it's the only language serious
people use for system programming.  Ignore C++ for a while (Jesus
Xavier Christ, I've had to do C++ programming for subsurface), and
just think about what makes _C_ a good language.

I can look at C code, and I can understand what the code generation
is, and what it will really *do*. And I think that's important.
Abstractions that hide what the compiler will actually generate are
bad abstractions.

And ok, so this is obviously Linux-specific in that it's generally
only Linux where I really care about the code generation, but I do
think it's a bigger issue too.

So I want C features to *map* to the hardware features they implement.
The abstractions should match each other, not fight each other.

 Actually, the fact that there are more potential optimizations than I can
 think of is a big reason for my insistence on the carries-a-dependency
 crap.  My lack of optimization omniscience makes me very nervous about
 relying on there never ever being a reasonable way of computing a given
 result without preserving the ordering.

But if I can give two clear examples that are basically identical from
a syntactic standpoint, and one clearly can be trivially optimized to
the point where the ordering guarantee goes away, and the other
cannot, and you cannot describe the difference, then I think your
description is seriously lacking.

And I do *not* think the C language should be defined by how it can be
described. Leave that to things like Haskell or LISP, where the goal
is some kind of completeness of the language that is about the
language, not about the machines it will run on.

 So the code sequence I already mentioned is *not* ordered:

 Litmus test 1:

 p = atomic_read(pp, consume);
 if (p == variable)
 return p-val;

is *NOT* ordered, because the compiler can trivially turn this into
 return variable.val, and break the data dependency.

 Right, given your model, the compiler is free to produce code that
 doesn't order the load from pp against the load from p-val.

Yes. Note also that that is what existing compilers would actually do.

And they'd do it by mistake: they'd load the address of the variable
into a register, and then compare the two registers, and then end up
using _one_ of the registers as the base pointer for the p-val
access, but I can almost *guarantee* that there are going to be
sequences where some compiler will choose one register over the other
based on some random detail.

So my model isn't just a model, it also happens to descibe reality.

 Indeed, it won't work across different compilation units unless
 the compiler is told about it, which is of course the whole point of
 [[carries_dependency]].  Understood, though, the Linux kernel currently
 does not have anything that could reasonably automatically generate those
 [[carries_dependency]] attributes.  (Or are there other reasons why you
 believe [[carries_dependency]] is problematic?)

So I think carries_dependency is problematic because:

 - it's not actually in C11 afaik

 - it requires the programmer to solve the problem of the standard not
matching the hardware.

 - I think it's just insanely ugly, *especially* if it's actually
meant to work so that the current carries-a-dependency works even for
insane expressions like a-a.

in practice, it's one of those things where I guess nobody actually
would ever use it.

 Of course, I cannot resist putting forward a third litmus test:

 static struct foo variable1;
 static struct foo variable2;
 static struct foo *pp = variable1;

 T1: initialize_foo(variable2);
 atomic_store_explicit(pp, variable2, memory_order_release);
 /* The above is the only store to pp in this translation unit,
  * and the address of pp is not exported in any way.
  */

 T2: if (p == variable1)
 return p-val1; /* Must be variable1.val1. */
 else
 return p-val2; /* Must be variable2.val2. */

 My guess is that your approach would not provide ordering in this
 case, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread George Spelvin
paul...@linux.vnet.ibm.com wrote:
 torva...@linux-foundation.org wrote:
 I have for the last several years been 100% convinced that the Intel
 memory ordering is the right thing, and that people who like weak
 memory ordering are wrong and should try to avoid reproducing if at
 all possible.

 Are ARM and Power really the bad boys here?  Or are they instead playing
 the role of the canary in the coal mine?

To paraphrase some older threads, I think Linus's argument is that
weak memory ordering is like branch delay slots: a way to make a simple
implementation simpler, but ends up being no help to a more aggressive
implementation.

Branch delay slots give a one-cycle bonus to in-order cores, but
once you go superscalar and add branch prediction, they stop helping,
and once you go full out of order, they're just an annoyance.

Likewise, I can see the point that weak ordering can help make a simple
cache interface simpler, but once you start doing speculative loads,
you've already bought and paid for all the hardware you need to do
stronger coherency.

Another thing that requires all the strong-coherency machinery is
a high-performance implementation of the various memory barrier and
synchronization operations.  Yes, a low-performance (drain the pipeline)
implementation is tolerable if the instructions aren't used frequently,
but once you're really trying, it doesn't save complexity.

Once you're there, strong coherency always doesn't actually cost you any
time outside of critical synchronization code, and it both simplifies
and speeds up the tricky synchronization software.


So PPC and ARM's weak ordering are not the direction the future is going.
Rather, weak ordering is something that's only useful in a limited
technology window, which is rapidly passing.

If you can find someone in IBM who's worked on the Z series cache
coherency (extremely strong ordering), they probably have some useful
insights.  The big question is if strong ordering, once you've accepted
the implementation complexity and area, actually costs anything in
execution time.  If there's an unavoidable cost which weak ordering saves,
that's significant.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Jeff Law

On 02/25/14 17:15, Paul E. McKenney wrote:

I have for the last several years been 100% convinced that the Intel
memory ordering is the right thing, and that people who like weak
memory ordering are wrong and should try to avoid reproducing if at
all possible. But given that we have memory orderings like power and
ARM, I don't actually see a sane way to get a good strong ordering.
You can teach compilers about cases like the above when they actually
see all the code and they could poison the value chain etc. But it
would be fairly painful, and once you cross object files (or even just
functions in the same compilation unit, for that matter), it goes from
painful to just ridiculously not worth it.


And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

I couldn't agree more.



Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?
That's a question I've been struggling with recently as well.  I suspect 
they (arm, power) are going to be the outliers rather than the canary. 
While the weaker model may give them some advantages WRT scalability, I 
don't think it'll ultimately be enough to overcome the difficulty in 
writing correct low level code for them.


Regardless, they're here and we have to deal with them.


Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  So let me see if I understand your reasoning.  My best guess is that it
  goes something like this:
 
  1.  The Linux kernel contains code that passes pointers from
  rcu_dereference() through external functions.
 
 No, actually, it's not so much Linux-specific at all.
 
 I'm actually thinking about what I'd do as a compiler writer, and as a
 defender the C is a high-level assembler concept.
 
 I love C. I'm a huge fan. I think it's a great language, and I think
 it's a great language not because of some theoretical issues, but
 because it is the only language around that actually maps fairly well
 to what machines really do.
 
 And it's a *simple* language. Sure, it's not quite as simple as it
 used to be, but look at how thin the KR book is. Which pretty much
 describes it - still.
 
 That's the real strength of C, and why it's the only language serious
 people use for system programming.  Ignore C++ for a while (Jesus
 Xavier Christ, I've had to do C++ programming for subsurface), and
 just think about what makes _C_ a good language.

The last time I used C++ for a project was in 1990.  It was a lot smaller
then.

 I can look at C code, and I can understand what the code generation
 is, and what it will really *do*. And I think that's important.
 Abstractions that hide what the compiler will actually generate are
 bad abstractions.
 
 And ok, so this is obviously Linux-specific in that it's generally
 only Linux where I really care about the code generation, but I do
 think it's a bigger issue too.
 
 So I want C features to *map* to the hardware features they implement.
 The abstractions should match each other, not fight each other.

OK...

  Actually, the fact that there are more potential optimizations than I can
  think of is a big reason for my insistence on the carries-a-dependency
  crap.  My lack of optimization omniscience makes me very nervous about
  relying on there never ever being a reasonable way of computing a given
  result without preserving the ordering.
 
 But if I can give two clear examples that are basically identical from
 a syntactic standpoint, and one clearly can be trivially optimized to
 the point where the ordering guarantee goes away, and the other
 cannot, and you cannot describe the difference, then I think your
 description is seriously lacking.

In my defense, my plan was to constrain the compiler to retain the
ordering guarantee in either case.  Yes, I did notice that you find
that unacceptable.

 And I do *not* think the C language should be defined by how it can be
 described. Leave that to things like Haskell or LISP, where the goal
 is some kind of completeness of the language that is about the
 language, not about the machines it will run on.

I am with you up to the point that the fancy optimizers start kicking
in.  I don't know how to describe what the optimizers are and are not
permitted to do strictly in terms of the underlying hardware.

  So the code sequence I already mentioned is *not* ordered:
 
  Litmus test 1:
 
  p = atomic_read(pp, consume);
  if (p == variable)
  return p-val;
 
 is *NOT* ordered, because the compiler can trivially turn this into
  return variable.val, and break the data dependency.
 
  Right, given your model, the compiler is free to produce code that
  doesn't order the load from pp against the load from p-val.
 
 Yes. Note also that that is what existing compilers would actually do.
 
 And they'd do it by mistake: they'd load the address of the variable
 into a register, and then compare the two registers, and then end up
 using _one_ of the registers as the base pointer for the p-val
 access, but I can almost *guarantee* that there are going to be
 sequences where some compiler will choose one register over the other
 based on some random detail.
 
 So my model isn't just a model, it also happens to descibe reality.

Sounds to me like your model -is- reality.  I believe that it is useful
to constrain reality from time to time, but understand that you vehemently
disagree.

  Indeed, it won't work across different compilation units unless
  the compiler is told about it, which is of course the whole point of
  [[carries_dependency]].  Understood, though, the Linux kernel currently
  does not have anything that could reasonably automatically generate those
  [[carries_dependency]] attributes.  (Or are there other reasons why you
  believe [[carries_dependency]] is problematic?)
 
 So I think carries_dependency is problematic because:
 
  - it's not actually in C11 afaik

Indeed it is not, but I bet that gcc will implement it like it does the
other attributes that are not part of C11.

  - it requires the programmer to solve the problem of the standard not
 matching the hardware.

The programmer in this instance being the compiler writer?

  - I 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote:
 paul...@linux.vnet.ibm.com wrote:
  torva...@linux-foundation.org wrote:
  I have for the last several years been 100% convinced that the Intel
  memory ordering is the right thing, and that people who like weak
  memory ordering are wrong and should try to avoid reproducing if at
  all possible.
 
  Are ARM and Power really the bad boys here?  Or are they instead playing
  the role of the canary in the coal mine?
 
 To paraphrase some older threads, I think Linus's argument is that
 weak memory ordering is like branch delay slots: a way to make a simple
 implementation simpler, but ends up being no help to a more aggressive
 implementation.
 
 Branch delay slots give a one-cycle bonus to in-order cores, but
 once you go superscalar and add branch prediction, they stop helping,
 and once you go full out of order, they're just an annoyance.
 
 Likewise, I can see the point that weak ordering can help make a simple
 cache interface simpler, but once you start doing speculative loads,
 you've already bought and paid for all the hardware you need to do
 stronger coherency.
 
 Another thing that requires all the strong-coherency machinery is
 a high-performance implementation of the various memory barrier and
 synchronization operations.  Yes, a low-performance (drain the pipeline)
 implementation is tolerable if the instructions aren't used frequently,
 but once you're really trying, it doesn't save complexity.
 
 Once you're there, strong coherency always doesn't actually cost you any
 time outside of critical synchronization code, and it both simplifies
 and speeds up the tricky synchronization software.
 
 
 So PPC and ARM's weak ordering are not the direction the future is going.
 Rather, weak ordering is something that's only useful in a limited
 technology window, which is rapidly passing.

That does indeed appear to be Intel's story.  Might well be correct.
Time will tell.

 If you can find someone in IBM who's worked on the Z series cache
 coherency (extremely strong ordering), they probably have some useful
 insights.  The big question is if strong ordering, once you've accepted
 the implementation complexity and area, actually costs anything in
 execution time.  If there's an unavoidable cost which weak ordering saves,
 that's significant.

There has been a lot of ink spilled on this argument.  ;-)

PPC has much larger CPU counts than does the mainframe.  On the other
hand, there are large x86 systems.  Some claim that there are differences
in latency due to the different approaches, and there could be a long
argument about whether all this in inherent in the memory ordering or
whether it is due to implementation issues.

I don't claim to know the answer.  I do know that ARM and PPC are
here now, and that I need to deal with them.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote:
 On 02/25/14 17:15, Paul E. McKenney wrote:
 I have for the last several years been 100% convinced that the Intel
 memory ordering is the right thing, and that people who like weak
 memory ordering are wrong and should try to avoid reproducing if at
 all possible. But given that we have memory orderings like power and
 ARM, I don't actually see a sane way to get a good strong ordering.
 You can teach compilers about cases like the above when they actually
 see all the code and they could poison the value chain etc. But it
 would be fairly painful, and once you cross object files (or even just
 functions in the same compilation unit, for that matter), it goes from
 painful to just ridiculously not worth it.
 
 And I have indeed seen a post or two from you favoring stronger memory
 ordering over the past few years.  ;-)
 I couldn't agree more.
 
 
 Are ARM and Power really the bad boys here?  Or are they instead playing
 the role of the canary in the coal mine?
 That's a question I've been struggling with recently as well.  I
 suspect they (arm, power) are going to be the outliers rather than
 the canary. While the weaker model may give them some advantages WRT
 scalability, I don't think it'll ultimately be enough to overcome
 the difficulty in writing correct low level code for them.
 
 Regardless, they're here and we have to deal with them.

Agreed...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
 wrote:
>
> Litmus test 1:
>
> p = atomic_read(pp, consume);
> if (p == )
> return p->val;
>
>is *NOT* ordered

Btw, don't get me wrong. I don't _like_ it not being ordered, and I
actually did spend some time thinking about my earlier proposal on
strengthening the 'consume' ordering.

I have for the last several years been 100% convinced that the Intel
memory ordering is the right thing, and that people who like weak
memory ordering are wrong and should try to avoid reproducing if at
all possible. But given that we have memory orderings like power and
ARM, I don't actually see a sane way to get a good strong ordering.
You can teach compilers about cases like the above when they actually
see all the code and they could poison the value chain etc. But it
would be fairly painful, and once you cross object files (or even just
functions in the same compilation unit, for that matter), it goes from
painful to just "ridiculously not worth it".

So I think the C semantics should mirror what the hardware gives us -
and do so even in the face of reasonable optimizations - not try to do
something else that requires compilers to treat "consume" very
differently.

If people made me king of the world, I'd outlaw weak memory ordering.
You can re-order as much as you want in hardware with speculation etc,
but you should always *check* your speculation and make it *look* like
you did everything in order. Which is pretty much the intel memory
ordering (ignoring the write buffering).

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 03:35:04PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney
>  wrote:
> >>
> >> What if the "nothing modifies 'p'" part looks like this:
> >>
> >> if (p != )
> >> return;
> >>
> >> and now any sane compiler will happily optimize "q = *p" into "q =
> >> myvariable", and we're all done - nothing invalid was ever
> >
> > Yes, the compiler could do that.  But it would still be required to
> > carry a dependency from the memory_order_consume read to the "*p",
> 
> But that's *BS*. You didn't actually listen to the main issue.
> 
> Paul, why do you insist on this carries-a-dependency crap?

Sigh.  Read on...

> It's broken. If you don't believe me, then believe the compiler person
> who already piped up and told you so.
> 
> The "carries a dependency" model is broken. Get over it.
> 
> No sane compiler will ever distinguish two different registers that
> have the same value from each other. No sane compiler will ever say
> "ok, register r1 has the exact same value as register r2, but r2
> carries the dependency, so I need to make sure to pass r2 to that
> function or use it as a base pointer".
> 
> And nobody sane should *expect* a compiler to distinguish two
> registers with the same value that way.
> 
> So the whole model is broken.
> 
> I gave an alternate model (the "restrict"), and you didn't seem to
> understand the really fundamental difference. It's not a language
> difference, it's a conceptual difference.
> 
> In the broken "carries a dependency" model, you have fight all those
> aliases that can have the same value, and it is not a fight you can
> win. We've had the "p-p" examples, we've had the "p&0" examples, but
> the fact is, that "p==" example IS EXACTLY THE SAME THING.
> 
> All three of those things: "p-p", "p&0", and "p==" mean
> that any compiler worth its salt now know that "p" carries no
> information, and will optimize it away.
> 
> So please stop arguing against that. Whenever you argue against that
> simple fact, you are arguing against sane compilers.

So let me see if I understand your reasoning.  My best guess is that it
goes something like this:

1.  The Linux kernel contains code that passes pointers from
rcu_dereference() through external functions.

2.  Code in the Linux kernel expects the normal RCU ordering
guarantees to be in effect even when external functions are
involved.

3.  When compiling one of these external functions, the C compiler
has no way of knowing about these RCU ordering guarantees.

4.  The C compiler might therefore apply any and all optimizations
to these external functions.

5.  This in turn implies that we the only way to prohibit any given
optimization from being applied to the results obtained from
rcu_dereference() is to prohibit that optimization globally.

6.  We have to be very careful what optimizations are globally
prohibited, because a poor choice could result in unacceptable
performance degradation.

7.  Therefore, the only operations that can be counted on to
maintain the needed RCU orderings are those where the compiler
really doesn't have any choice, in other words, where any
reasonable way of computing the result will necessarily maintain
the needed ordering.

Did I get this right, or am I confused?

> So *accept* the fact that some operations (and I guarantee that there
> are more of those than you can think of, and you can create them with
> various tricks using pretty much *any* feature in the C language)
> essentially take the data information away. And just accept the fact
> that then the ordering goes away too.

Actually, the fact that there are more potential optimizations than I can
think of is a big reason for my insistence on the carries-a-dependency
crap.  My lack of optimization omniscience makes me very nervous about
relying on there never ever being a reasonable way of computing a given
result without preserving the ordering.

> So give up on "carries a dependency". Because there will be cases
> where that dependency *isn't* carried.
> 
> The language of the standard needs to get *away* from the broken
> model, because otherwise the standard is broken.
> 
> I suggest we instead talk about "litmus tests" and why certain code
> sequences are ordered, and others are not.

OK...

> So the code sequence I already mentioned is *not* ordered:
> 
> Litmus test 1:
> 
> p = atomic_read(pp, consume);
> if (p == )
> return p->val;
> 
>is *NOT* ordered, because the compiler can trivially turn this into
> "return variable.val", and break the data dependency.

Right, given your model, the compiler is free to produce code that
doesn't order the load from pp against the load from p->val.

>This is true *regardless* of any "carries a dependency" language,
> because that language is insane, and doesn't work when the 

  1   2   3   4   5   6   >