Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote: > On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: > > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > > +oDo not use the results from the boolean "&&" and "||" > > > > > > > > > when > > > > > > > > > + dereferencing. For example, the following (rather > > > > > > > > > improbable) > > > > > > > > > + code is buggy: > > > > > > > > > + > > > > > > > > > + int a[2]; > > > > > > > > > + int index; > > > > > > > > > + int force_zero_index = 1; > > > > > > > > > + > > > > > > > > > + ... > > > > > > > > > + > > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > > + > > > > > > > > > + The reason this is buggy is that "&&" and "||" are > > > > > > > > > often compiled > > > > > > > > > + using branches. While weak-memory machines such as ARM > > > > > > > > > or PowerPC > > > > > > > > > + do order stores after such branches, they can speculate > > > > > > > > > loads, > > > > > > > > > + which can result in misordering bugs. > > > > > > > > > + > > > > > > > > > +oDo not use the results from relational operators ("==", > > > > > > > > > "!=", > > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For > > > > > > > > > example, > > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > > + > > > > > > > > > + int a[2]; > > > > > > > > > + int index; > > > > > > > > > + int flip_index = 0; > > > > > > > > > + > > > > > > > > > + ... > > > > > > > > > + > > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > > + > > > > > > > > > + As before, the reason this is buggy is that relational > > > > > > > > > operators > > > > > > > > > + are often compiled using branches. And as before, > > > > > > > > > although > > > > > > > > > + weak-memory machines such as ARM or PowerPC do order > > > > > > > > > stores > > > > > > > > > + after such branches, but can speculate loads, which can > > > > > > > > > again > > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently > > > > > > > > proposed, > > > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > > > (unless > > > > > > > > there are further constraints due to the type of r1 and the > > > > > > > > values that > > > > > > > > flip_index can have). > > > > > > > > > > > > > > And I am OK with the value_dep_preserving type providing > > > > > > > more/better > > > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > > > dependency to be tracked through a comparison operator. What does > > > > > > > the developer do in that case? (The reason I ask is that I have > > > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > > > dependency to be tracked through a comparison.) > > > > > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > > > comparison? > > > > > > > > > > That should work well assuming that things like "if", "while", and > > > > > "?:" > > > > > conditions are happy to take a vdp. This assumes that p->a only > > > > > returns > > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > > > through the program. ;-) > > > > > > > > > > The other thing that can happen is that a vdp can get handed off to > > > > > another synchronization mechanism, for example, to reference counting: > > > > > > > > > > p = atomic_load_explicit(, memory_order_consume); > > > > > if (do_something_with(p->a)) { > > > > > /* fast path protected by RCU. */ > > > > > return 0; > > > > > } > > > > > if (atomic_inc_not_zero(>refcnt) { > > > > > /* slow path protected by
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote: > xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com > X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6) > > On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: > > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > > > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > > > > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > +o Do not use the results from the boolean "&&" and "||" > > > > > > > > when > > > > > > > > + dereferencing. For example, the following (rather > > > > > > > > improbable) > > > > > > > > + code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int force_zero_index = 1; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + The reason this is buggy is that "&&" and "||" are > > > > > > > > often compiled > > > > > > > > + using branches. While weak-memory machines such as ARM > > > > > > > > or PowerPC > > > > > > > > + do order stores after such branches, they can speculate > > > > > > > > loads, > > > > > > > > + which can result in misordering bugs. > > > > > > > > + > > > > > > > > +o Do not use the results from relational operators ("==", > > > > > > > > "!=", > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For > > > > > > > > example, > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int flip_index = 0; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + As before, the reason this is buggy is that relational > > > > > > > > operators > > > > > > > > + are often compiled using branches. And as before, > > > > > > > > although > > > > > > > > + weak-memory machines such as ARM or PowerPC do order > > > > > > > > stores > > > > > > > > + after such branches, but can speculate loads, which can > > > > > > > > again > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently > > > > > > > proposed, > > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > > (unless > > > > > > > there are further constraints due to the type of r1 and the > > > > > > > values that > > > > > > > flip_index can have). > > > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > > dependency to be tracked through a comparison operator. What does > > > > > > the developer do in that case? (The reason I ask is that I have > > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > > dependency to be tracked through a comparison.) > > > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > > comparison? > > > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > > conditions are happy to take a vdp. > > > > > > I currently don't see a reason why that should be disallowed. If we > > > have allowed an implicit conversion to non-vdp, I believe that should > > > follow. > > > > I am a bit nervous about a silent implicit conversion from vdp to > > non-vdp in the general case. > > Why are you nervous about it? If someone expects the vdp to propagate into some function that might be compiled with aggressive optimizations that break this expectation, it would be good for that someone to know about it. Ah! I am assuming that the compiler is -not- emitting memory barriers at vdp-to-non-vdp transitions. In that case,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > +o Do not use the results from the boolean "&&" and "||" > > > > > > > > when > > > > > > > > + dereferencing. For example, the following (rather > > > > > > > > improbable) > > > > > > > > + code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int force_zero_index = 1; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + The reason this is buggy is that "&&" and "||" are > > > > > > > > often compiled > > > > > > > > + using branches. While weak-memory machines such as ARM > > > > > > > > or PowerPC > > > > > > > > + do order stores after such branches, they can speculate > > > > > > > > loads, > > > > > > > > + which can result in misordering bugs. > > > > > > > > + > > > > > > > > +o Do not use the results from relational operators ("==", > > > > > > > > "!=", > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For > > > > > > > > example, > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int flip_index = 0; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + As before, the reason this is buggy is that relational > > > > > > > > operators > > > > > > > > + are often compiled using branches. And as before, > > > > > > > > although > > > > > > > > + weak-memory machines such as ARM or PowerPC do order > > > > > > > > stores > > > > > > > > + after such branches, but can speculate loads, which can > > > > > > > > again > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently > > > > > > > proposed, > > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > > (unless > > > > > > > there are further constraints due to the type of r1 and the > > > > > > > values that > > > > > > > flip_index can have). > > > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > > dependency to be tracked through a comparison operator. What does > > > > > > the developer do in that case? (The reason I ask is that I have > > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > > dependency to be tracked through a comparison.) > > > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > > comparison? > > > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > > conditions are happy to take a vdp. This assumes that p->a only returns > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > > through the program. ;-) > > > > > > > > The other thing that can happen is that a vdp can get handed off to > > > > another synchronization mechanism, for example, to reference counting: > > > > > > > > p = atomic_load_explicit(, memory_order_consume); > > > > if (do_something_with(p->a)) { > > > > /* fast path protected by RCU. */ > > > > return 0; > > > > } > > > > if (atomic_inc_not_zero(>refcnt) { > > > > /* slow path protected by reference counting. */ > > > > return do_something_else_with((struct foo *)p); /* > > > > CHANGE */ > > > > } > > > > /* Needed slow path, but raced with deletion. */ > > > > return
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > +oDo not use the results from the boolean "&&" and "||" > > > > > > > when > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > + code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int force_zero_index = 1; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > > > compiled > > > > > > > + using branches. While weak-memory machines such as ARM or > > > > > > > PowerPC > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > + which can result in misordering bugs. > > > > > > > + > > > > > > > +oDo not use the results from relational operators ("==", > > > > > > > "!=", > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > + the following (quite strange) code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int flip_index = 0; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > + are often compiled using branches. And as before, although > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > + result in misordering bugs. > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > (unless > > > > > > there are further constraints due to the type of r1 and the values > > > > > > that > > > > > > flip_index can have). > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > dependency to be tracked through a comparison operator. What does > > > > > the developer do in that case? (The reason I ask is that I have > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > dependency to be tracked through a comparison.) > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > comparison? > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > conditions are happy to take a vdp. > > > > I currently don't see a reason why that should be disallowed. If we > > have allowed an implicit conversion to non-vdp, I believe that should > > follow. > > I am a bit nervous about a silent implicit conversion from vdp to > non-vdp in the general case. Why are you nervous about it? > However, when the result is being used by > a conditional, the silent implicit conversion makes a lot of sense. > Is that distinction something that the compiler can handle easily? I think so. I'm not a language lawyer, but we have other such conversions in the standard (e.g., int to boolean, between int and float) and I currently don't see a fundamental difference to those. But we'll have to ask the language folks (or SG1 or LEWG) to really verify that. > On the other hand, silent implicit conversion from non-vdp to vdp > is very useful for common code that can be invoked both by RCU > readers and by updaters. I'd be more nervous about that because then there's less obstacles to one programmer expecting a vdp to indicate a dependency vs. another programmer putting non-vdp into vdp. For this case of common code (which I agree is a valid concern), would it be a lot of programmer overhead to add explicit casts from non-vdp to vdp?
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. Why are you nervous about it? However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? I think so. I'm not a language lawyer, but we have other such conversions in the standard (e.g., int to boolean, between int and float) and I currently don't see a fundamental difference to those. But we'll have to ask the language folks (or SG1 or LEWG) to really verify that. On the other hand, silent implicit conversion from non-vdp to vdp is very useful for common code that can be invoked both by RCU readers and by updaters. I'd be more nervous about that because then there's less obstacles to one programmer expecting a vdp to indicate a dependency vs. another programmer putting non-vdp into vdp. For this case of common code (which I agree is a valid concern), would it be a lot of programmer overhead to add explicit casts from non-vdp to vdp? Would C11 generics help with that, similarly to how C++ template functions would? Nonetheless, in the end this is just trading off convenient use against different ways to catch different but simple errors. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, vdp x = non-vdp ? vdp : vdp; should be allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the || operators to be evaluated arithmetically because the entire function is an if condition, correct? That's a good question, and one that as far as I understand
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote: xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6) On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. Why are you nervous about it? If someone expects the vdp to propagate into some function that might be compiled with aggressive optimizations that break this expectation, it would be good for that someone to know about it. Ah! I am assuming that the compiler is -not- emitting memory barriers at vdp-to-non-vdp transitions. In that case, warnings are even more important -- without the warnings, it is a real pain chasing these unnecessary memory barriers out of the code. So we are -not- in the business of emitting memory barriers on vdp-to-non-vdp transitions, right? However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? I think so. I'm not a language lawyer, but we have other such conversions in the standard (e.g., int to boolean, between int and float) and I currently don't see a fundamental difference to those. But we'll have to ask the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote: On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the ||
Re: [RFC][PATCH 0/5] arch: atomic rework
On 5 March 2014 17:15, Torvald Riegel wrote: > On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote: >> On 3 March 2014 20:44, Torvald Riegel wrote: >> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: >> >> On 1 March 2014 08:03, Paul E. McKenney >> >> wrote: >> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> >> Hi Paul, >> >> >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney >> >> >> wrote: >> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> >> > wrote: >> >> >> >> > > >> >> >> >> > > 3. The comparison was against another RCU-protected >> >> >> >> > > pointer, >> >> >> >> > > where that other pointer was properly fetched using one >> >> >> >> > > of the RCU primitives. Here it doesn't matter which >> >> >> >> > > pointer >> >> >> >> > > you use. At least as long as the rcu_assign_pointer() >> >> >> >> > > for >> >> >> >> > > that other pointer happened after the last update to the >> >> >> >> > > pointed-to structure. >> >> >> >> > > >> >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> >> > >> >> >> >> > I think that it might be worth pointing out as an example, and >> >> >> >> > saying >> >> >> >> > that code like >> >> >> >> > >> >> >> >> >p = atomic_read(consume); >> >> >> >> >X; >> >> >> >> >q = atomic_read(consume); >> >> >> >> >Y; >> >> >> >> >if (p == q) >> >> >> >> > data = p->val; >> >> >> >> > >> >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> >> > *either* p or q, but you can't really tell which, since the >> >> >> >> > compiler >> >> >> >> > can decide that the values are interchangeable. >> >> >> >> > >> >> >> >> > I cannot for the life of me come up with a situation where this >> >> >> >> > would >> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether >> >> >> >> > the >> >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> >> > irrelevant. No? >> >> >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the >> >> >> >> only >> >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU >> >> >> >> can >> >> >> >> reorder the loads into p and q, and the read from p->val can, as >> >> >> >> you say, >> >> >> >> come from either pointer. >> >> >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> >> if (p == NULL) >> >> >> >> deal_with_it(); >> >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> >> p->b = 43; >> >> >> >> p->c = 44; >> >> >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> >> p->b = 143; >> >> >> >> p->c = 144; >> >> >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> >> >> >> >> >> T2: p = atomic_load_explicit(, memory_order_consume); >> >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> >> q = atomic_load_explicit(, memory_order_consume); >> >> >> >> if (p == q) { >> >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> >> } >> >> >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get >> >> >> >> what >> >> >> >> you get. >> >> >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating >> >> >> >> it, >> >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> >> trouble anyway. If you really want to do something like that and >> >> >> >> still >> >> >> >> see consistency across all the fields in the structure, please put >> >> >> >> a lock >> >> >> >> in the structure and use it to guard updates and accesses to those >> >> >> >> fields. >> >> >> > >> >> >> > And here is a patch documenting the restrictions for the current >> >> >> > Linux >> >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> >> > differently than atomic_load_explicit(, memory_order_consume). >> >> >> > >> >> >> > Thoughts? >> >> >> >> >> >> That might serve as informal documentation for linux kernel >> >> >> programmers about the bounds on the optimisations that you expect >> >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> >> intend it to be for. But I don't see how
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > +oDo not use the results from the boolean "&&" and "||" > > > > > > > when > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > + code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int force_zero_index = 1; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > > > compiled > > > > > > > + using branches. While weak-memory machines such as ARM or > > > > > > > PowerPC > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > + which can result in misordering bugs. > > > > > > > + > > > > > > > +oDo not use the results from relational operators ("==", > > > > > > > "!=", > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > + the following (quite strange) code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int flip_index = 0; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > + are often compiled using branches. And as before, although > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > + result in misordering bugs. > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > AFAICS. r1 != flip_index would result in two possible values > > > > > > (unless > > > > > > there are further constraints due to the type of r1 and the values > > > > > > that > > > > > > flip_index can have). > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > dependency to be tracked through a comparison operator. What does > > > > > the developer do in that case? (The reason I ask is that I have > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > dependency to be tracked through a comparison.) > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > comparison? > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > conditions are happy to take a vdp. This assumes that p->a only returns > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > through the program. ;-) > > > > > > The other thing that can happen is that a vdp can get handed off to > > > another synchronization mechanism, for example, to reference counting: > > > > > > p = atomic_load_explicit(, memory_order_consume); > > > if (do_something_with(p->a)) { > > > /* fast path protected by RCU. */ > > > return 0; > > > } > > > if (atomic_inc_not_zero(>refcnt) { > > > /* slow path protected by reference counting. */ > > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > > } > > > /* Needed slow path, but raced with deletion. */ > > > return -EAGAIN; > > > > > > I am guessing that the cast ends the vdp. Is that the case? > > > > And here is a more elaborate example from the Linux kernel: > > > > struct md_rdev value_dep_preserving *rdev; /* CHANGE */ > > > > rdev = rcu_dereference(conf->mirrors[disk].rdev); > > if (r1_bio->bios[disk] == IO_BLOCKED > > || rdev == NULL > > || test_bit(Unmerged, >flags) > > || test_bit(Faulty, >flags)) > > continue; > > > > The fact that the "rdev == NULL" returns vdp does not force the "||" > > operators to be evaluated arithmetically because the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > + code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int force_zero_index = 1; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > > compiled > > > > > > + using branches. While weak-memory machines such as ARM or > > > > > > PowerPC > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > + which can result in misordering bugs. > > > > > > + > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > + the following (quite strange) code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int flip_index = 0; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > + are often compiled using branches. And as before, although > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > + after such branches, but can speculate loads, which can again > > > > > > + result in misordering bugs. > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > there are further constraints due to the type of r1 and the values > > > > > that > > > > > flip_index can have). > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > guarantees than we get by default from current compilers. > > > > > > > > One question, though. Suppose that the code did not want a value > > > > dependency to be tracked through a comparison operator. What does > > > > the developer do in that case? (The reason I ask is that I have > > > > not yet found a use case in the Linux kernel that expects a value > > > > dependency to be tracked through a comparison.) > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > comparison? > > > > That should work well assuming that things like "if", "while", and "?:" > > conditions are happy to take a vdp. > > I currently don't see a reason why that should be disallowed. If we > have allowed an implicit conversion to non-vdp, I believe that should > follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? On the other hand, silent implicit conversion from non-vdp to vdp is very useful for common code that can be invoked both by RCU readers and by updaters. > ?: could be somewhat special, in that the type depends on the > 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be > allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be > disallowed if we don't provide for implicit casts from non-vdp to vdp. Actually, from the Linux-kernel code that I am seeing, we want to be able to silently convert from non-vdp to vdp in order to permit common code that is invoked from both RCU readers (vdp) and updaters (often non-vdp). This common code must be compiled conservatively to allow vdp, but should be just find with non-vdp. Going through the combinations... 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */ 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */ 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */ 3. vdp
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote: > On 3 March 2014 20:44, Torvald Riegel wrote: > > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: > >> On 1 March 2014 08:03, Paul E. McKenney wrote: > >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> >> Hi Paul, > >> >> > >> >> On 28 February 2014 18:50, Paul E. McKenney > >> >> wrote: > >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> >> > wrote: > >> >> >> > > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> >> > > where that other pointer was properly fetched using one > >> >> >> > > of the RCU primitives. Here it doesn't matter which > >> >> >> > > pointer > >> >> >> > > you use. At least as long as the rcu_assign_pointer() > >> >> >> > > for > >> >> >> > > that other pointer happened after the last update to the > >> >> >> > > pointed-to structure. > >> >> >> > > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> >> > > >> >> >> > I think that it might be worth pointing out as an example, and > >> >> >> > saying > >> >> >> > that code like > >> >> >> > > >> >> >> >p = atomic_read(consume); > >> >> >> >X; > >> >> >> >q = atomic_read(consume); > >> >> >> >Y; > >> >> >> >if (p == q) > >> >> >> > data = p->val; > >> >> >> > > >> >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> >> > *either* p or q, but you can't really tell which, since the > >> >> >> > compiler > >> >> >> > can decide that the values are interchangeable. > >> >> >> > > >> >> >> > I cannot for the life of me come up with a situation where this > >> >> >> > would > >> >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> >> > stronger ordering than anything the consume through "p" would > >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> >> > ordering to the access through "p" is through p or q is kind of > >> >> >> > irrelevant. No? > >> >> >> > >> >> >> I can make a contrived litmus test for it, but you are right, the > >> >> >> only > >> >> >> time you can see it happen is when X has no barriers, in which case > >> >> >> you don't have any ordering anyway -- both the compiler and the CPU > >> >> >> can > >> >> >> reorder the loads into p and q, and the read from p->val can, as you > >> >> >> say, > >> >> >> come from either pointer. > >> >> >> > >> >> >> For whatever it is worth, hear is the litmus test: > >> >> >> > >> >> >> T1: p = kmalloc(...); > >> >> >> if (p == NULL) > >> >> >> deal_with_it(); > >> >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> >> p->b = 43; > >> >> >> p->c = 44; > >> >> >> atomic_store_explicit(, p, memory_order_release); > >> >> >> p->b = 143; > >> >> >> p->c = 144; > >> >> >> atomic_store_explicit(, p, memory_order_release); > >> >> >> > >> >> >> T2: p = atomic_load_explicit(, memory_order_consume); > >> >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> >> q = atomic_load_explicit(, memory_order_consume); > >> >> >> if (p == q) { > >> >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> >> } > >> >> >> > >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get > >> >> >> what > >> >> >> you get. > >> >> >> > >> >> >> And publishing a structure via one RCU-protected pointer, updating > >> >> >> it, > >> >> >> then publishing it via another pointer seems to me to be asking for > >> >> >> trouble anyway. If you really want to do something like that and > >> >> >> still > >> >> >> see consistency across all the fields in the structure, please put a > >> >> >> lock > >> >> >> in the structure and use it to guard updates and accesses to those > >> >> >> fields. > >> >> > > >> >> > And here is a patch documenting the restrictions for the current Linux > >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> >> > differently than atomic_load_explicit(, memory_order_consume). > >> >> > > >> >> > Thoughts? > >> >> > >> >> That might serve as informal documentation for linux kernel > >> >> programmers about the bounds on the optimisations that you expect > >> >> compilers to do for common-case RCU code - and I guess that's what you > >> >> intend it to be for. But I don't see how one can make it precise > >> >> enough to serve as a language definition, so that compiler people > >> >> could confidently say "yes, we respect that", which I guess is what > >> >> you really need. As
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > + code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int force_zero_index = 1; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > > compiled > > > > > > + using branches. While weak-memory machines such as ARM or > > > > > > PowerPC > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > + which can result in misordering bugs. > > > > > > + > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > + the following (quite strange) code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int flip_index = 0; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > + are often compiled using branches. And as before, although > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > + after such branches, but can speculate loads, which can again > > > > > > + result in misordering bugs. > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > there are further constraints due to the type of r1 and the values > > > > > that > > > > > flip_index can have). > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > guarantees than we get by default from current compilers. > > > > > > > > One question, though. Suppose that the code did not want a value > > > > dependency to be tracked through a comparison operator. What does > > > > the developer do in that case? (The reason I ask is that I have > > > > not yet found a use case in the Linux kernel that expects a value > > > > dependency to be tracked through a comparison.) > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > comparison? > > > > That should work well assuming that things like "if", "while", and "?:" > > conditions are happy to take a vdp. This assumes that p->a only returns > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > through the program. ;-) > > > > The other thing that can happen is that a vdp can get handed off to > > another synchronization mechanism, for example, to reference counting: > > > > p = atomic_load_explicit(, memory_order_consume); > > if (do_something_with(p->a)) { > > /* fast path protected by RCU. */ > > return 0; > > } > > if (atomic_inc_not_zero(>refcnt) { > > /* slow path protected by reference counting. */ > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > } > > /* Needed slow path, but raced with deletion. */ > > return -EAGAIN; > > > > I am guessing that the cast ends the vdp. Is that the case? > > And here is a more elaborate example from the Linux kernel: > > struct md_rdev value_dep_preserving *rdev; /* CHANGE */ > > rdev = rcu_dereference(conf->mirrors[disk].rdev); > if (r1_bio->bios[disk] == IO_BLOCKED > || rdev == NULL > || test_bit(Unmerged, >flags) > || test_bit(Faulty, >flags)) > continue; > > The fact that the "rdev == NULL" returns vdp does not force the "||" > operators to be evaluated arithmetically because the entire function > is an "if" condition, correct? That's a good question, and one that as far as I understand currently, essentially boils down to whether we want to have tight restrictions on which operations are still vdp. If we look at
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > +oDo not use the results from the boolean "&&" and "||" when > > > > > + dereferencing. For example, the following (rather improbable) > > > > > + code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int force_zero_index = 1; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > + > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > compiled > > > > > + using branches. While weak-memory machines such as ARM or > > > > > PowerPC > > > > > + do order stores after such branches, they can speculate loads, > > > > > + which can result in misordering bugs. > > > > > + > > > > > +oDo not use the results from relational operators ("==", "!=", > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > + the following (quite strange) code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int flip_index = 0; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > + > > > > > + As before, the reason this is buggy is that relational operators > > > > > + are often compiled using branches. And as before, although > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > + after such branches, but can speculate loads, which can again > > > > > + result in misordering bugs. > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > there are further constraints due to the type of r1 and the values that > > > > flip_index can have). > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > guarantees than we get by default from current compilers. > > > > > > One question, though. Suppose that the code did not want a value > > > dependency to be tracked through a comparison operator. What does > > > the developer do in that case? (The reason I ask is that I have > > > not yet found a use case in the Linux kernel that expects a value > > > dependency to be tracked through a comparison.) > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > comparison? > > That should work well assuming that things like "if", "while", and "?:" > conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp. > This assumes that p->a only returns > vdp if field "a" is declared vdp, otherwise we have vdps running wild > through the program. ;-) That's a good question. For the scheme I had in mind, I'm not concerned about vdps running wild because one needs to assign to explicitly vdp-typed variables (or function arguments, etc.) to let vdp extend to beyond single expressions. Nonetheless, I think it's a good question how -> should behave if the field is not vdp; in particular, should vdp->non_vdp be automatically vdp? One concern might be that we know something about non-vdp -- OTOH, we shouldn't be able to do so because we (assume to) don't know anything about the vdp pointer, so we can't infer something about something it points to. > The other thing that can happen is that a vdp can get handed off to > another synchronization mechanism, for example, to reference counting: > > p = atomic_load_explicit(, memory_order_consume); > if (do_something_with(p->a)) { > /* fast path protected by RCU. */ > return 0; > } > if (atomic_inc_not_zero(>refcnt) { Is the argument to atomic_inc_no_zero vdp or non-vdp? > /* slow path protected by reference counting. */ > return do_something_else_with((struct foo *)p); /*
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, vdp x = non-vdp ? vdp : vdp; should be allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) That's a good question. For the scheme I had in mind, I'm not concerned about vdps running wild because one needs to assign to explicitly vdp-typed variables (or function arguments, etc.) to let vdp extend to beyond single expressions. Nonetheless, I think it's a good question how - should behave if the field is not vdp; in particular, should vdp-non_vdp be automatically vdp? One concern might be that we know something about non-vdp -- OTOH, we shouldn't be able to do so because we (assume to) don't know anything about the vdp pointer, so we can't infer something about something it points to. The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { Is the argument to atomic_inc_no_zero vdp or non-vdp? /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? That would end it, yes. The other way this could happen is that the argument of do_something_else_with() would be specified to be non-vdp. -- To unsubscribe from this
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the || operators to be evaluated arithmetically because the entire function is an if condition, correct? That's a good question, and one that as far as I understand currently, essentially boils down to whether we want to have tight restrictions on which operations are still vdp. If we look at the different combinations, then it seems we can't decide on whether we have a value-dependency just due to a vdp type: * non-vdp || vdp: vdp iff non-vdp == false * vdp || non-vdp: vdp iff non-vdp == false? * vdp || vdp: always vdp? (and dependency on both?) I'm not sure it makes sense to try to not make all of those vdp-by-default. The first and second case show that it's dependent on the specific execution anyway, and thus is
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote: On 3 March 2014 20:44, Torvald Riegel trie...@redhat.com wrote: On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? On the other hand, silent implicit conversion from non-vdp to vdp is very useful for common code that can be invoked both by RCU readers and by updaters. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, vdp x = non-vdp ? vdp : vdp; should be allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp. Actually, from the Linux-kernel code that I am seeing, we want to be able to silently convert from non-vdp to vdp in order to permit common code that is invoked from both RCU readers (vdp) and updaters (often non-vdp). This common code must be compiled conservatively to allow vdp, but should be just find with non-vdp. Going through the combinations... 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */ 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */ 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */ 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */ 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */ 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */ 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */ 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */ 8. non-vdp x = vdp ? vdp : vdp;
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the || operators to be evaluated arithmetically because the entire function is an if condition, correct? That's a good question, and one that as far as I understand currently, essentially boils down to whether we want to have tight restrictions on which operations are still vdp. If we look at the different combinations, then it seems we can't decide on whether we have a value-dependency just due to a vdp type: * non-vdp || vdp: vdp iff non-vdp == false * vdp || non-vdp: vdp iff non-vdp == false? * vdp || vdp: always vdp? (and dependency on both?) I'm not sure it makes sense to try to not make
Re: [RFC][PATCH 0/5] arch: atomic rework
On 5 March 2014 17:15, Torvald Riegel trie...@redhat.com wrote: On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote: On 3 March 2014 20:44, Torvald Riegel trie...@redhat.com wrote: On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make
Re: [RFC][PATCH 0/5] arch: atomic rework
On 3 March 2014 20:44, Torvald Riegel wrote: > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: >> On 1 March 2014 08:03, Paul E. McKenney wrote: >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> Hi Paul, >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney >> >> wrote: >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> > wrote: >> >> >> > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> >> > > where that other pointer was properly fetched using one >> >> >> > > of the RCU primitives. Here it doesn't matter which >> >> >> > > pointer >> >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> >> > > that other pointer happened after the last update to the >> >> >> > > pointed-to structure. >> >> >> > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> > >> >> >> > I think that it might be worth pointing out as an example, and saying >> >> >> > that code like >> >> >> > >> >> >> >p = atomic_read(consume); >> >> >> >X; >> >> >> >q = atomic_read(consume); >> >> >> >Y; >> >> >> >if (p == q) >> >> >> > data = p->val; >> >> >> > >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> >> > can decide that the values are interchangeable. >> >> >> > >> >> >> > I cannot for the life of me come up with a situation where this would >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> > irrelevant. No? >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> >> reorder the loads into p and q, and the read from p->val can, as you >> >> >> say, >> >> >> come from either pointer. >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> if (p == NULL) >> >> >> deal_with_it(); >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> p->b = 43; >> >> >> p->c = 44; >> >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> p->b = 143; >> >> >> p->c = 144; >> >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> >> >> >> T2: p = atomic_load_explicit(, memory_order_consume); >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> q = atomic_load_explicit(, memory_order_consume); >> >> >> if (p == q) { >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> } >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> >> you get. >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> trouble anyway. If you really want to do something like that and still >> >> >> see consistency across all the fields in the structure, please put a >> >> >> lock >> >> >> in the structure and use it to guard updates and accesses to those >> >> >> fields. >> >> > >> >> > And here is a patch documenting the restrictions for the current Linux >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> > differently than atomic_load_explicit(, memory_order_consume). >> >> > >> >> > Thoughts? >> >> >> >> That might serve as informal documentation for linux kernel >> >> programmers about the bounds on the optimisations that you expect >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> intend it to be for. But I don't see how one can make it precise >> >> enough to serve as a language definition, so that compiler people >> >> could confidently say "yes, we respect that", which I guess is what >> >> you really need. As a useful criterion, we should aim for something >> >> precise enough that in a verified-compiler context you can >> >> mathematically prove that the compiler will satisfy it (even though >> >> that won't happen anytime soon for GCC), and that analysis tool >> >> authors can actually know what they're working with. All this stuff >> >> about "you should avoid
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > +oDo not use the results from the boolean "&&" and "||" when > > > > > + dereferencing. For example, the following (rather improbable) > > > > > + code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int force_zero_index = 1; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > + > > > > > + The reason this is buggy is that "&&" and "||" are often > > > > > compiled > > > > > + using branches. While weak-memory machines such as ARM or > > > > > PowerPC > > > > > + do order stores after such branches, they can speculate loads, > > > > > + which can result in misordering bugs. > > > > > + > > > > > +oDo not use the results from relational operators ("==", "!=", > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > + the following (quite strange) code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int flip_index = 0; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > + > > > > > + As before, the reason this is buggy is that relational operators > > > > > + are often compiled using branches. And as before, although > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > + after such branches, but can speculate loads, which can again > > > > > + result in misordering bugs. > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > there are further constraints due to the type of r1 and the values that > > > > flip_index can have). > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > guarantees than we get by default from current compilers. > > > > > > One question, though. Suppose that the code did not want a value > > > dependency to be tracked through a comparison operator. What does > > > the developer do in that case? (The reason I ask is that I have > > > not yet found a use case in the Linux kernel that expects a value > > > dependency to be tracked through a comparison.) > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > comparison? > > That should work well assuming that things like "if", "while", and "?:" > conditions are happy to take a vdp. This assumes that p->a only returns > vdp if field "a" is declared vdp, otherwise we have vdps running wild > through the program. ;-) > > The other thing that can happen is that a vdp can get handed off to > another synchronization mechanism, for example, to reference counting: > > p = atomic_load_explicit(, memory_order_consume); > if (do_something_with(p->a)) { > /* fast path protected by RCU. */ > return 0; > } > if (atomic_inc_not_zero(>refcnt) { > /* slow path protected by reference counting. */ > return do_something_else_with((struct foo *)p); /* CHANGE */ > } > /* Needed slow path, but raced with deletion. */ > return -EAGAIN; > > I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf->mirrors[disk].rdev); if (r1_bio->bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, >flags) || test_bit(Faulty, >flags)) continue; The fact that the "rdev == NULL" returns vdp does not force the "||" operators to be evaluated arithmetically because the entire function is an "if" condition, correct? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > + dereferencing. For example, the following (rather improbable) > > > > + code is buggy: > > > > + > > > > + int a[2]; > > > > + int index; > > > > + int force_zero_index = 1; > > > > + > > > > + ... > > > > + > > > > + r1 = rcu_dereference(i1) > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > + > > > > + The reason this is buggy is that "&&" and "||" are often > > > > compiled > > > > + using branches. While weak-memory machines such as ARM or > > > > PowerPC > > > > + do order stores after such branches, they can speculate loads, > > > > + which can result in misordering bugs. > > > > + > > > > +o Do not use the results from relational operators ("==", "!=", > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > + the following (quite strange) code is buggy: > > > > + > > > > + int a[2]; > > > > + int index; > > > > + int flip_index = 0; > > > > + > > > > + ... > > > > + > > > > + r1 = rcu_dereference(i1) > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > + > > > > + As before, the reason this is buggy is that relational operators > > > > + are often compiled using branches. And as before, although > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > + after such branches, but can speculate loads, which can again > > > > + result in misordering bugs. > > > > > > Those two would be allowed by the wording I have recently proposed, > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > there are further constraints due to the type of r1 and the values that > > > flip_index can have). > > > > And I am OK with the value_dep_preserving type providing more/better > > guarantees than we get by default from current compilers. > > > > One question, though. Suppose that the code did not want a value > > dependency to be tracked through a comparison operator. What does > > the developer do in that case? (The reason I ask is that I have > > not yet found a use case in the Linux kernel that expects a value > > dependency to be tracked through a comparison.) > > Hmm. I suppose use an explicit cast to non-vdp before or after the > comparison? That should work well assuming that things like "if", "while", and "?:" conditions are happy to take a vdp. This assumes that p->a only returns vdp if field "a" is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(, memory_order_consume); if (do_something_with(p->a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(>refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the || operators to be evaluated arithmetically because the entire function is an if condition, correct? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On 3 March 2014 20:44, Torvald Riegel trie...@redhat.com wrote: On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > +oDo not use the results from the boolean "&&" and "||" when > > > + dereferencing. For example, the following (rather improbable) > > > + code is buggy: > > > + > > > + int a[2]; > > > + int index; > > > + int force_zero_index = 1; > > > + > > > + ... > > > + > > > + r1 = rcu_dereference(i1) > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > + > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > + do order stores after such branches, they can speculate loads, > > > + which can result in misordering bugs. > > > + > > > +oDo not use the results from relational operators ("==", "!=", > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > + the following (quite strange) code is buggy: > > > + > > > + int a[2]; > > > + int index; > > > + int flip_index = 0; > > > + > > > + ... > > > + > > > + r1 = rcu_dereference(i1) > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > + > > > + As before, the reason this is buggy is that relational operators > > > + are often compiled using branches. And as before, although > > > + weak-memory machines such as ARM or PowerPC do order stores > > > + after such branches, but can speculate loads, which can again > > > + result in misordering bugs. > > > > Those two would be allowed by the wording I have recently proposed, > > AFAICS. r1 != flip_index would result in two possible values (unless > > there are further constraints due to the type of r1 and the values that > > flip_index can have). > > And I am OK with the value_dep_preserving type providing more/better > guarantees than we get by default from current compilers. > > One question, though. Suppose that the code did not want a value > dependency to be tracked through a comparison operator. What does > the developer do in that case? (The reason I ask is that I have > not yet found a use case in the Linux kernel that expects a value > dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: > On 1 March 2014 08:03, Paul E. McKenney wrote: > > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> Hi Paul, > >> > >> On 28 February 2014 18:50, Paul E. McKenney > >> wrote: > >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> > wrote: > >> >> > > > >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> > > where that other pointer was properly fetched using one > >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> > > that other pointer happened after the last update to the > >> >> > > pointed-to structure. > >> >> > > > >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> > > >> >> > I think that it might be worth pointing out as an example, and saying > >> >> > that code like > >> >> > > >> >> >p = atomic_read(consume); > >> >> >X; > >> >> >q = atomic_read(consume); > >> >> >Y; > >> >> >if (p == q) > >> >> > data = p->val; > >> >> > > >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> > can decide that the values are interchangeable. > >> >> > > >> >> > I cannot for the life of me come up with a situation where this would > >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> > stronger ordering than anything the consume through "p" would > >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> > ordering to the access through "p" is through p or q is kind of > >> >> > irrelevant. No? > >> >> > >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> time you can see it happen is when X has no barriers, in which case > >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> reorder the loads into p and q, and the read from p->val can, as you > >> >> say, > >> >> come from either pointer. > >> >> > >> >> For whatever it is worth, hear is the litmus test: > >> >> > >> >> T1: p = kmalloc(...); > >> >> if (p == NULL) > >> >> deal_with_it(); > >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> p->b = 43; > >> >> p->c = 44; > >> >> atomic_store_explicit(, p, memory_order_release); > >> >> p->b = 143; > >> >> p->c = 144; > >> >> atomic_store_explicit(, p, memory_order_release); > >> >> > >> >> T2: p = atomic_load_explicit(, memory_order_consume); > >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> q = atomic_load_explicit(, memory_order_consume); > >> >> if (p == q) { > >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> } > >> >> > >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> you get. > >> >> > >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> then publishing it via another pointer seems to me to be asking for > >> >> trouble anyway. If you really want to do something like that and still > >> >> see consistency across all the fields in the structure, please put a > >> >> lock > >> >> in the structure and use it to guard updates and accesses to those > >> >> fields. > >> > > >> > And here is a patch documenting the restrictions for the current Linux > >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> > differently than atomic_load_explicit(, memory_order_consume). > >> > > >> > Thoughts? > >> > >> That might serve as informal documentation for linux kernel > >> programmers about the bounds on the optimisations that you expect > >> compilers to do for common-case RCU code - and I guess that's what you > >> intend it to be for. But I don't see how one can make it precise > >> enough to serve as a language definition, so that compiler people > >> could confidently say "yes, we respect that", which I guess is what > >> you really need. As a useful criterion, we should aim for something > >> precise enough that in a verified-compiler context you can > >> mathematically prove that the compiler will satisfy it (even though > >> that won't happen anytime soon for GCC), and that analysis tool > >> authors can actually know what they're working with. All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a > >> small number of bits" is just too vague. > > > > Understood, and yes, this is intended to document current
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > +o Do not use the results from the boolean "&&" and "||" when > > + dereferencing. For example, the following (rather improbable) > > + code is buggy: > > + > > + int a[2]; > > + int index; > > + int force_zero_index = 1; > > + > > + ... > > + > > + r1 = rcu_dereference(i1) > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > + > > + The reason this is buggy is that "&&" and "||" are often compiled > > + using branches. While weak-memory machines such as ARM or PowerPC > > + do order stores after such branches, they can speculate loads, > > + which can result in misordering bugs. > > + > > +o Do not use the results from relational operators ("==", "!=", > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > + the following (quite strange) code is buggy: > > + > > + int a[2]; > > + int index; > > + int flip_index = 0; > > + > > + ... > > + > > + r1 = rcu_dereference(i1) > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > + > > + As before, the reason this is buggy is that relational operators > > + are often compiled using branches. And as before, although > > + weak-memory machines such as ARM or PowerPC do order stores > > + after such branches, but can speculate loads, which can again > > + result in misordering bugs. > > Those two would be allowed by the wording I have recently proposed, > AFAICS. r1 != flip_index would result in two possible values (unless > there are further constraints due to the type of r1 and the values that > flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) > I don't think the wording is flawed. We could raise the requirement of > having more than one value left for r1 to having more than N with N > 1 > values left, but the fundamental problem remains in that a compiler > could try to generate a (big) switch statement. > > Instead, I think that this indicates that the value_dep_preserving type > modifier would be useful: It would tell the compiler that it shouldn't > transform this into a branch in this case, yet allow that optimization > for all other code. Understood! BTW, my current task is generating examples using the value_dep_preserving type for RCU-protected array indexes. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > +oDo not use the results from the boolean "&&" and "||" when > + dereferencing. For example, the following (rather improbable) > + code is buggy: > + > + int a[2]; > + int index; > + int force_zero_index = 1; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > + > + The reason this is buggy is that "&&" and "||" are often compiled > + using branches. While weak-memory machines such as ARM or PowerPC > + do order stores after such branches, they can speculate loads, > + which can result in misordering bugs. > + > +oDo not use the results from relational operators ("==", "!=", > + ">", ">=", "<", or "<=") when dereferencing. For example, > + the following (quite strange) code is buggy: > + > + int a[2]; > + int index; > + int flip_index = 0; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > + > + As before, the reason this is buggy is that relational operators > + are often compiled using branches. And as before, although > + weak-memory machines such as ARM or PowerPC do order stores > + after such branches, but can speculate loads, which can again > + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). I don't think the wording is flawed. We could raise the requirement of having more than one value left for r1 to having more than N with N > 1 values left, but the fundamental problem remains in that a compiler could try to generate a (big) switch statement. Instead, I think that this indicates that the value_dep_preserving type modifier would be useful: It would tell the compiler that it shouldn't transform this into a branch in this case, yet allow that optimization for all other code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote: > Your proposal looks quite promising at first glance. But rather than > try and comment on it immediately, I am going to take a number of uses of > RCU from the Linux kernel and apply your proposal to them, then respond > with the results > > Fair enough? Sure. Thanks for doing the cross-check! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > wrote: > > > > 3. The comparison was against another RCU-protected pointer, > > where that other pointer was properly fetched using one > > of the RCU primitives. Here it doesn't matter which pointer > > you use. At least as long as the rcu_assign_pointer() for > > that other pointer happened after the last update to the > > pointed-to structure. > > > > I am a bit nervous about #3. Any thoughts on it? > > I think that it might be worth pointing out as an example, and saying > that code like > >p = atomic_read(consume); >X; >q = atomic_read(consume); >Y; >if (p == q) > data = p->val; > > then the access of "p->val" is constrained to be data-dependent on > *either* p or q, but you can't really tell which, since the compiler > can decide that the values are interchangeable. The wording I proposed would make the p dereference have a value dependency unless X and Y would somehow restrict p and q. The reasoning is that if the atomic loads return potentially more than one value, then even if we find out that two such loads did return the same value, we still don't know what the exact value was. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel wrote: > > Regarding the latter, we make a fresh start at each mo_consume load (ie, > > we assume we know nothing -- L could have returned any possible value); > > I believe this is easier to reason about than other scopes like function > > granularities (what happens on inlining?), or translation units. It > > should also be simple to implement for compilers, and would hopefully > > not constrain optimization too much. > > > > [...] > > > > Paul's litmus test would work, because we guarantee to the programmer > > that it can assume that the mo_consume load would return any value > > allowed by the type; effectively, this forbids the compiler analysis > > Paul thought about: > > So realistically, since with the new wording we can ignore the silly > cases (ie "p-p") and we can ignore the trivial-to-optimize compiler > cases ("if (p == ) .. use p"), and you would forbid the > "global value range optimization case" that Paul bright up, what > remains would seem to be just really subtle compiler transformations > of data dependencies to control dependencies. > > And the only such thing I can think of is basically compiler-initiated > value-prediction, presumably directed by PGO (since now if the value > prediction is in the source code, it's considered to break the value > chain). The other example that comes to mind would be feedback-directed JIT compilation. I don't think that's widely used today, and it might never be for the kernel -- but *in the standard*, we at least have to consider what the future might bring. > The good thing is that afaik, value-prediction is largely not used in > real life, afaik. There are lots of papers on it, but I don't think > anybody actually does it (although I can easily see some > specint-specific optimization pattern that is build up around it). > > And even value prediction is actually fine, as long as the compiler > can see the memory *source* of the value prediction (and it isn't a > mo_consume). So it really ends up limiting your value prediction in > very simple ways: you cannot do it to function arguments if they are > registers. But you can still do value prediction on values you loaded > from memory, if you can actually *see* that memory op. I think one would need to show that the source is *not even indirectly* a mo_consume load. With the wording I proposed, value dependencies don't break when storing to / loading from memory locations. Thus, if a compiler ends up at a memory load after waling SSA, it needs to prove that the load cannot read a value that (1) was produced by a store sequenced-before the load and (2) might carry a value dependency (e.g., by being a mo_consume load) that the value prediction in question would break. This, in general, requires alias analysis. Deciding whether a prediction would break a value dependency has to consider what later stages in a compiler would be doing, including LTO or further rounds of inlining/optimizations. OTOH, if the compiler can treat an mo_consume load as returning all possible values (eg, by ignoring all knowledge about it), then it can certainly do so with other memory loads too. So, I think that the constraints due to value dependencies can matter in practice. However, the impact on optimizations on non-mo_consume-related code are hard to estimate -- I don't see a huge amount of impact right now, but I also wouldn't want to predict that this can't change in the future. > Of course, on more strongly ordered CPU's, even that "register > argument" limitation goes away. > > So I agree that there is basically no real optimization constraint. > Value-prediction is of dubious value to begin with, and the actual > constraint on its use if some compiler writer really wants to is not > onerous. > > > What I have in mind is roughly the following (totally made-up syntax -- > > suggestions for how to do this properly are very welcome): > > * Have a type modifier (eg, like restrict), that specifies that > > operations on data of this type are preserving value dependencies: > > So I'm not violently opposed, but I think the upsides are not great. > Note that my earlier suggestion to use "restrict" wasn't because I > believed the annotation itself would be visible, but basically just as > a legalistic promise to the compiler that *if* it found an alias, then > it didn't need to worry about ordering. So to me, that type modifier > was about conceptual guarantees, not about actual value chains. > > Anyway, the reason I don't believe any type modifier (and > "[[carries_dependency]]" is basically just that) is worth it is simply > that it adds a real burden on the programmer, without actually giving > the programmer any real upside: > > Within a single function, the compiler already sees that mo_consume > source, and so doing a type-based restriction doesn't really help. The > information is already
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote: Regarding the latter, we make a fresh start at each mo_consume load (ie, we assume we know nothing -- L could have returned any possible value); I believe this is easier to reason about than other scopes like function granularities (what happens on inlining?), or translation units. It should also be simple to implement for compilers, and would hopefully not constrain optimization too much. [...] Paul's litmus test would work, because we guarantee to the programmer that it can assume that the mo_consume load would return any value allowed by the type; effectively, this forbids the compiler analysis Paul thought about: So realistically, since with the new wording we can ignore the silly cases (ie p-p) and we can ignore the trivial-to-optimize compiler cases (if (p == variable) .. use p), and you would forbid the global value range optimization case that Paul bright up, what remains would seem to be just really subtle compiler transformations of data dependencies to control dependencies. And the only such thing I can think of is basically compiler-initiated value-prediction, presumably directed by PGO (since now if the value prediction is in the source code, it's considered to break the value chain). The other example that comes to mind would be feedback-directed JIT compilation. I don't think that's widely used today, and it might never be for the kernel -- but *in the standard*, we at least have to consider what the future might bring. The good thing is that afaik, value-prediction is largely not used in real life, afaik. There are lots of papers on it, but I don't think anybody actually does it (although I can easily see some specint-specific optimization pattern that is build up around it). And even value prediction is actually fine, as long as the compiler can see the memory *source* of the value prediction (and it isn't a mo_consume). So it really ends up limiting your value prediction in very simple ways: you cannot do it to function arguments if they are registers. But you can still do value prediction on values you loaded from memory, if you can actually *see* that memory op. I think one would need to show that the source is *not even indirectly* a mo_consume load. With the wording I proposed, value dependencies don't break when storing to / loading from memory locations. Thus, if a compiler ends up at a memory load after waling SSA, it needs to prove that the load cannot read a value that (1) was produced by a store sequenced-before the load and (2) might carry a value dependency (e.g., by being a mo_consume load) that the value prediction in question would break. This, in general, requires alias analysis. Deciding whether a prediction would break a value dependency has to consider what later stages in a compiler would be doing, including LTO or further rounds of inlining/optimizations. OTOH, if the compiler can treat an mo_consume load as returning all possible values (eg, by ignoring all knowledge about it), then it can certainly do so with other memory loads too. So, I think that the constraints due to value dependencies can matter in practice. However, the impact on optimizations on non-mo_consume-related code are hard to estimate -- I don't see a huge amount of impact right now, but I also wouldn't want to predict that this can't change in the future. Of course, on more strongly ordered CPU's, even that register argument limitation goes away. So I agree that there is basically no real optimization constraint. Value-prediction is of dubious value to begin with, and the actual constraint on its use if some compiler writer really wants to is not onerous. What I have in mind is roughly the following (totally made-up syntax -- suggestions for how to do this properly are very welcome): * Have a type modifier (eg, like restrict), that specifies that operations on data of this type are preserving value dependencies: So I'm not violently opposed, but I think the upsides are not great. Note that my earlier suggestion to use restrict wasn't because I believed the annotation itself would be visible, but basically just as a legalistic promise to the compiler that *if* it found an alias, then it didn't need to worry about ordering. So to me, that type modifier was about conceptual guarantees, not about actual value chains. Anyway, the reason I don't believe any type modifier (and [[carries_dependency]] is basically just that) is worth it is simply that it adds a real burden on the programmer, without actually giving the programmer any real upside: Within a single function, the compiler already sees that mo_consume source, and so doing a type-based restriction doesn't really help. The information is already there, without any burden on the programmer. I think it's not just a
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote: Your proposal looks quite promising at first glance. But rather than try and comment on it immediately, I am going to take a number of uses of RCU from the Linux kernel and apply your proposal to them, then respond with the results Fair enough? Sure. Thanks for doing the cross-check! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. The wording I proposed would make the p dereference have a value dependency unless X and Y would somehow restrict p and q. The reasoning is that if the atomic loads return potentially more than one value, then even if we find out that two such loads did return the same value, we still don't know what the exact value was. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). I don't think the wording is flawed. We could raise the requirement of having more than one value left for r1 to having more than N with N 1 values left, but the fundamental problem remains in that a compiler could try to generate a (big) switch statement. Instead, I think that this indicates that the value_dep_preserving type modifier would be useful: It would tell the compiler that it shouldn't transform this into a branch in this case, yet allow that optimization for all other code. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) I don't think the wording is flawed. We could raise the requirement of having more than one value left for r1 to having more than N with N 1 values left, but the fundamental problem remains in that a compiler could try to generate a (big) switch statement. Instead, I think that this indicates that the value_dep_preserving type modifier would be useful: It would tell the compiler that it shouldn't transform this into a branch in this case, yet allow that optimization for all other code. Understood! BTW, my current task is generating examples using the value_dep_preserving type for RCU-protected array indexes. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote: > On 2 March 2014 23:20, Paul E. McKenney wrote: > > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: > >> On 1 March 2014 08:03, Paul E. McKenney wrote: > >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> >> Hi Paul, > >> >> > >> >> On 28 February 2014 18:50, Paul E. McKenney > >> >> wrote: > >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> >> > wrote: > >> >> >> > > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> >> > > where that other pointer was properly fetched using one > >> >> >> > > of the RCU primitives. Here it doesn't matter which > >> >> >> > > pointer > >> >> >> > > you use. At least as long as the rcu_assign_pointer() > >> >> >> > > for > >> >> >> > > that other pointer happened after the last update to the > >> >> >> > > pointed-to structure. > >> >> >> > > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> >> > > >> >> >> > I think that it might be worth pointing out as an example, and > >> >> >> > saying > >> >> >> > that code like > >> >> >> > > >> >> >> >p = atomic_read(consume); > >> >> >> >X; > >> >> >> >q = atomic_read(consume); > >> >> >> >Y; > >> >> >> >if (p == q) > >> >> >> > data = p->val; > >> >> >> > > >> >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> >> > *either* p or q, but you can't really tell which, since the > >> >> >> > compiler > >> >> >> > can decide that the values are interchangeable. > >> >> >> > > >> >> >> > I cannot for the life of me come up with a situation where this > >> >> >> > would > >> >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> >> > stronger ordering than anything the consume through "p" would > >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> >> > ordering to the access through "p" is through p or q is kind of > >> >> >> > irrelevant. No? > >> >> >> > >> >> >> I can make a contrived litmus test for it, but you are right, the > >> >> >> only > >> >> >> time you can see it happen is when X has no barriers, in which case > >> >> >> you don't have any ordering anyway -- both the compiler and the CPU > >> >> >> can > >> >> >> reorder the loads into p and q, and the read from p->val can, as you > >> >> >> say, > >> >> >> come from either pointer. > >> >> >> > >> >> >> For whatever it is worth, hear is the litmus test: > >> >> >> > >> >> >> T1: p = kmalloc(...); > >> >> >> if (p == NULL) > >> >> >> deal_with_it(); > >> >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> >> p->b = 43; > >> >> >> p->c = 44; > >> >> >> atomic_store_explicit(, p, memory_order_release); > >> >> >> p->b = 143; > >> >> >> p->c = 144; > >> >> >> atomic_store_explicit(, p, memory_order_release); > >> >> >> > >> >> >> T2: p = atomic_load_explicit(, memory_order_consume); > >> >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> >> q = atomic_load_explicit(, memory_order_consume); > >> >> >> if (p == q) { > >> >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> >> } > >> >> >> > >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get > >> >> >> what > >> >> >> you get. > >> >> >> > >> >> >> And publishing a structure via one RCU-protected pointer, updating > >> >> >> it, > >> >> >> then publishing it via another pointer seems to me to be asking for > >> >> >> trouble anyway. If you really want to do something like that and > >> >> >> still > >> >> >> see consistency across all the fields in the structure, please put a > >> >> >> lock > >> >> >> in the structure and use it to guard updates and accesses to those > >> >> >> fields. > >> >> > > >> >> > And here is a patch documenting the restrictions for the current Linux > >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> >> > differently than atomic_load_explicit(, memory_order_consume). > >> >> > > >> >> > Thoughts? > >> >> > >> >> That might serve as informal documentation for linux kernel > >> >> programmers about the bounds on the optimisations that you expect > >> >> compilers to do for common-case RCU code - and I guess that's what you > >> >> intend it to be for. But I don't see how one can make it precise > >> >> enough to serve as a language definition, so that compiler people > >> >> could confidently say "yes, we respect that", which I guess is what > >> >> you
Re: [RFC][PATCH 0/5] arch: atomic rework
On 2 March 2014 23:20, Paul E. McKenney wrote: > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: >> On 1 March 2014 08:03, Paul E. McKenney wrote: >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> Hi Paul, >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney >> >> wrote: >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> > wrote: >> >> >> > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> >> > > where that other pointer was properly fetched using one >> >> >> > > of the RCU primitives. Here it doesn't matter which >> >> >> > > pointer >> >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> >> > > that other pointer happened after the last update to the >> >> >> > > pointed-to structure. >> >> >> > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> > >> >> >> > I think that it might be worth pointing out as an example, and saying >> >> >> > that code like >> >> >> > >> >> >> >p = atomic_read(consume); >> >> >> >X; >> >> >> >q = atomic_read(consume); >> >> >> >Y; >> >> >> >if (p == q) >> >> >> > data = p->val; >> >> >> > >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> >> > can decide that the values are interchangeable. >> >> >> > >> >> >> > I cannot for the life of me come up with a situation where this would >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> > irrelevant. No? >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> >> reorder the loads into p and q, and the read from p->val can, as you >> >> >> say, >> >> >> come from either pointer. >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> if (p == NULL) >> >> >> deal_with_it(); >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> p->b = 43; >> >> >> p->c = 44; >> >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> p->b = 143; >> >> >> p->c = 144; >> >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> >> >> >> T2: p = atomic_load_explicit(, memory_order_consume); >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> q = atomic_load_explicit(, memory_order_consume); >> >> >> if (p == q) { >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> } >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> >> you get. >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> trouble anyway. If you really want to do something like that and still >> >> >> see consistency across all the fields in the structure, please put a >> >> >> lock >> >> >> in the structure and use it to guard updates and accesses to those >> >> >> fields. >> >> > >> >> > And here is a patch documenting the restrictions for the current Linux >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> > differently than atomic_load_explicit(, memory_order_consume). >> >> > >> >> > Thoughts? >> >> >> >> That might serve as informal documentation for linux kernel >> >> programmers about the bounds on the optimisations that you expect >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> intend it to be for. But I don't see how one can make it precise >> >> enough to serve as a language definition, so that compiler people >> >> could confidently say "yes, we respect that", which I guess is what >> >> you really need. As a useful criterion, we should aim for something >> >> precise enough that in a verified-compiler context you can >> >> mathematically prove that the compiler will satisfy it (even though >> >> that won't happen anytime soon for GCC), and that analysis tool >> >> authors can actually know what they're working with. All this stuff > >> >> >> about "you
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: > On 1 March 2014 08:03, Paul E. McKenney wrote: > > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> Hi Paul, > >> > >> On 28 February 2014 18:50, Paul E. McKenney > >> wrote: > >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> > wrote: > >> >> > > > >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> > > where that other pointer was properly fetched using one > >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> > > that other pointer happened after the last update to the > >> >> > > pointed-to structure. > >> >> > > > >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> > > >> >> > I think that it might be worth pointing out as an example, and saying > >> >> > that code like > >> >> > > >> >> >p = atomic_read(consume); > >> >> >X; > >> >> >q = atomic_read(consume); > >> >> >Y; > >> >> >if (p == q) > >> >> > data = p->val; > >> >> > > >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> > can decide that the values are interchangeable. > >> >> > > >> >> > I cannot for the life of me come up with a situation where this would > >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> > stronger ordering than anything the consume through "p" would > >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> > ordering to the access through "p" is through p or q is kind of > >> >> > irrelevant. No? > >> >> > >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> time you can see it happen is when X has no barriers, in which case > >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> reorder the loads into p and q, and the read from p->val can, as you > >> >> say, > >> >> come from either pointer. > >> >> > >> >> For whatever it is worth, hear is the litmus test: > >> >> > >> >> T1: p = kmalloc(...); > >> >> if (p == NULL) > >> >> deal_with_it(); > >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> p->b = 43; > >> >> p->c = 44; > >> >> atomic_store_explicit(, p, memory_order_release); > >> >> p->b = 143; > >> >> p->c = 144; > >> >> atomic_store_explicit(, p, memory_order_release); > >> >> > >> >> T2: p = atomic_load_explicit(, memory_order_consume); > >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> q = atomic_load_explicit(, memory_order_consume); > >> >> if (p == q) { > >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> } > >> >> > >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> you get. > >> >> > >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> then publishing it via another pointer seems to me to be asking for > >> >> trouble anyway. If you really want to do something like that and still > >> >> see consistency across all the fields in the structure, please put a > >> >> lock > >> >> in the structure and use it to guard updates and accesses to those > >> >> fields. > >> > > >> > And here is a patch documenting the restrictions for the current Linux > >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> > differently than atomic_load_explicit(, memory_order_consume). > >> > > >> > Thoughts? > >> > >> That might serve as informal documentation for linux kernel > >> programmers about the bounds on the optimisations that you expect > >> compilers to do for common-case RCU code - and I guess that's what you > >> intend it to be for. But I don't see how one can make it precise > >> enough to serve as a language definition, so that compiler people > >> could confidently say "yes, we respect that", which I guess is what > >> you really need. As a useful criterion, we should aim for something > >> precise enough that in a verified-compiler context you can > >> mathematically prove that the compiler will satisfy it (even though > >> that won't happen anytime soon for GCC), and that analysis tool > >> authors can actually know what they're working with. All this stuff > >> > >> about "you should avoid cancellation", and "avoid masking with just a > >> small number of bits" is just too vague. > > > > Understood, and yes, this is intended to document
Re: [RFC][PATCH 0/5] arch: atomic rework
On 1 March 2014 08:03, Paul E. McKenney wrote: > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> Hi Paul, >> >> On 28 February 2014 18:50, Paul E. McKenney >> wrote: >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> > wrote: >> >> > > >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> > > where that other pointer was properly fetched using one >> >> > > of the RCU primitives. Here it doesn't matter which pointer >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> > > that other pointer happened after the last update to the >> >> > > pointed-to structure. >> >> > > >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> > >> >> > I think that it might be worth pointing out as an example, and saying >> >> > that code like >> >> > >> >> >p = atomic_read(consume); >> >> >X; >> >> >q = atomic_read(consume); >> >> >Y; >> >> >if (p == q) >> >> > data = p->val; >> >> > >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> > can decide that the values are interchangeable. >> >> > >> >> > I cannot for the life of me come up with a situation where this would >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> > stronger ordering than anything the consume through "p" would >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> > ordering to the access through "p" is through p or q is kind of >> >> > irrelevant. No? >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> time you can see it happen is when X has no barriers, in which case >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> reorder the loads into p and q, and the read from p->val can, as you say, >> >> come from either pointer. >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> T1: p = kmalloc(...); >> >> if (p == NULL) >> >> deal_with_it(); >> >> p->a = 42; /* Each field in its own cache line. */ >> >> p->b = 43; >> >> p->c = 44; >> >> atomic_store_explicit(, p, memory_order_release); >> >> p->b = 143; >> >> p->c = 144; >> >> atomic_store_explicit(, p, memory_order_release); >> >> >> >> T2: p = atomic_load_explicit(, memory_order_consume); >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> q = atomic_load_explicit(, memory_order_consume); >> >> if (p == q) { >> >> /* The compiler decides that q->c is same as p->c. */ >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> } >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> you get. >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> then publishing it via another pointer seems to me to be asking for >> >> trouble anyway. If you really want to do something like that and still >> >> see consistency across all the fields in the structure, please put a lock >> >> in the structure and use it to guard updates and accesses to those fields. >> > >> > And here is a patch documenting the restrictions for the current Linux >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> > differently than atomic_load_explicit(, memory_order_consume). >> > >> > Thoughts? >> >> That might serve as informal documentation for linux kernel >> programmers about the bounds on the optimisations that you expect >> compilers to do for common-case RCU code - and I guess that's what you >> intend it to be for. But I don't see how one can make it precise >> enough to serve as a language definition, so that compiler people >> could confidently say "yes, we respect that", which I guess is what >> you really need. As a useful criterion, we should aim for something >> precise enough that in a verified-compiler context you can >> mathematically prove that the compiler will satisfy it (even though >> that won't happen anytime soon for GCC), and that analysis tool >> authors can actually know what they're working with. All this stuff >> about "you should avoid cancellation", and "avoid masking with just a >> small number of bits" is just too vague. > > Understood, and yes, this is intended to document current compiler > behavior for the Linux kernel community. It would not make sense to show > it to the C11 or C++11 communities, except perhaps as an informational > piece on current practice. > >> The basic problem is that the compiler may be doing sophisticated >> reasoning with a bunch
Re: [RFC][PATCH 0/5] arch: atomic rework
On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive enough for RCU idioms, in which that reasoning doesn't allow data/address dependencies to be removed (and hence the hardware guarantee about them will be maintained at the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope,
Re: [RFC][PATCH 0/5] arch: atomic rework
On 2 March 2014 23:20, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote: On 2 March 2014 23:20, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > Hi Paul, > > On 28 February 2014 18:50, Paul E. McKenney > wrote: > > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> > wrote: > >> > > > >> > > 3. The comparison was against another RCU-protected pointer, > >> > > where that other pointer was properly fetched using one > >> > > of the RCU primitives. Here it doesn't matter which pointer > >> > > you use. At least as long as the rcu_assign_pointer() for > >> > > that other pointer happened after the last update to the > >> > > pointed-to structure. > >> > > > >> > > I am a bit nervous about #3. Any thoughts on it? > >> > > >> > I think that it might be worth pointing out as an example, and saying > >> > that code like > >> > > >> >p = atomic_read(consume); > >> >X; > >> >q = atomic_read(consume); > >> >Y; > >> >if (p == q) > >> > data = p->val; > >> > > >> > then the access of "p->val" is constrained to be data-dependent on > >> > *either* p or q, but you can't really tell which, since the compiler > >> > can decide that the values are interchangeable. > >> > > >> > I cannot for the life of me come up with a situation where this would > >> > matter, though. If "X" contains a fence, then that fence will be a > >> > stronger ordering than anything the consume through "p" would > >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> > ordering to the access through "p" is through p or q is kind of > >> > irrelevant. No? > >> > >> I can make a contrived litmus test for it, but you are right, the only > >> time you can see it happen is when X has no barriers, in which case > >> you don't have any ordering anyway -- both the compiler and the CPU can > >> reorder the loads into p and q, and the read from p->val can, as you say, > >> come from either pointer. > >> > >> For whatever it is worth, hear is the litmus test: > >> > >> T1: p = kmalloc(...); > >> if (p == NULL) > >> deal_with_it(); > >> p->a = 42; /* Each field in its own cache line. */ > >> p->b = 43; > >> p->c = 44; > >> atomic_store_explicit(, p, memory_order_release); > >> p->b = 143; > >> p->c = 144; > >> atomic_store_explicit(, p, memory_order_release); > >> > >> T2: p = atomic_load_explicit(, memory_order_consume); > >> r1 = p->b; /* Guaranteed to get 143. */ > >> q = atomic_load_explicit(, memory_order_consume); > >> if (p == q) { > >> /* The compiler decides that q->c is same as p->c. */ > >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> } > >> > >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> you get. > >> > >> And publishing a structure via one RCU-protected pointer, updating it, > >> then publishing it via another pointer seems to me to be asking for > >> trouble anyway. If you really want to do something like that and still > >> see consistency across all the fields in the structure, please put a lock > >> in the structure and use it to guard updates and accesses to those fields. > > > > And here is a patch documenting the restrictions for the current Linux > > kernel. The rules change a bit due to rcu_dereference() acting a bit > > differently than atomic_load_explicit(, memory_order_consume). > > > > Thoughts? > > That might serve as informal documentation for linux kernel > programmers about the bounds on the optimisations that you expect > compilers to do for common-case RCU code - and I guess that's what you > intend it to be for. But I don't see how one can make it precise > enough to serve as a language definition, so that compiler people > could confidently say "yes, we respect that", which I guess is what > you really need. As a useful criterion, we should aim for something > precise enough that in a verified-compiler context you can > mathematically prove that the compiler will satisfy it (even though > that won't happen anytime soon for GCC), and that analysis tool > authors can actually know what they're working with. All this stuff > about "you should avoid cancellation", and "avoid masking with just a > small number of bits" is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. > The basic problem is that the compiler may be doing sophisticated > reasoning with a bunch of non-local knowledge that it's deduced from > the code, neither of which are well-understood, and here we have to > identify some envelope, expressive
Re: [RFC][PATCH 0/5] arch: atomic rework
Hi Paul, On 28 February 2014 18:50, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> > wrote: >> > > >> > > 3. The comparison was against another RCU-protected pointer, >> > > where that other pointer was properly fetched using one >> > > of the RCU primitives. Here it doesn't matter which pointer >> > > you use. At least as long as the rcu_assign_pointer() for >> > > that other pointer happened after the last update to the >> > > pointed-to structure. >> > > >> > > I am a bit nervous about #3. Any thoughts on it? >> > >> > I think that it might be worth pointing out as an example, and saying >> > that code like >> > >> >p = atomic_read(consume); >> >X; >> >q = atomic_read(consume); >> >Y; >> >if (p == q) >> > data = p->val; >> > >> > then the access of "p->val" is constrained to be data-dependent on >> > *either* p or q, but you can't really tell which, since the compiler >> > can decide that the values are interchangeable. >> > >> > I cannot for the life of me come up with a situation where this would >> > matter, though. If "X" contains a fence, then that fence will be a >> > stronger ordering than anything the consume through "p" would >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> > atomic reads of p and q are unordered *anyway*, so then whether the >> > ordering to the access through "p" is through p or q is kind of >> > irrelevant. No? >> >> I can make a contrived litmus test for it, but you are right, the only >> time you can see it happen is when X has no barriers, in which case >> you don't have any ordering anyway -- both the compiler and the CPU can >> reorder the loads into p and q, and the read from p->val can, as you say, >> come from either pointer. >> >> For whatever it is worth, hear is the litmus test: >> >> T1: p = kmalloc(...); >> if (p == NULL) >> deal_with_it(); >> p->a = 42; /* Each field in its own cache line. */ >> p->b = 43; >> p->c = 44; >> atomic_store_explicit(, p, memory_order_release); >> p->b = 143; >> p->c = 144; >> atomic_store_explicit(, p, memory_order_release); >> >> T2: p = atomic_load_explicit(, memory_order_consume); >> r1 = p->b; /* Guaranteed to get 143. */ >> q = atomic_load_explicit(, memory_order_consume); >> if (p == q) { >> /* The compiler decides that q->c is same as p->c. */ >> r2 = p->c; /* Could get 44 on weakly order system. */ >> } >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> you get. >> >> And publishing a structure via one RCU-protected pointer, updating it, >> then publishing it via another pointer seems to me to be asking for >> trouble anyway. If you really want to do something like that and still >> see consistency across all the fields in the structure, please put a lock >> in the structure and use it to guard updates and accesses to those fields. > > And here is a patch documenting the restrictions for the current Linux > kernel. The rules change a bit due to rcu_dereference() acting a bit > differently than atomic_load_explicit(, memory_order_consume). > > Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say "yes, we respect that", which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about "you should avoid cancellation", and "avoid masking with just a small number of bits" is just too vague. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive enough for RCU idioms, in which that reasoning doesn't allow data/address dependencies to be removed (and hence the hardware guarantee about them will be maintained at the source level). The C11 syntactic notion of dependency, whatever its faults, was at least precise, could be reasoned about locally (just looking at the syntactic code in question), and did do that. The fact that current compilers do optimisations that remove dependencies and will likely have many bugs at present is besides the
Re: [RFC][PATCH 0/5] arch: atomic rework
Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive enough for RCU idioms, in which that reasoning doesn't allow data/address dependencies to be removed (and hence the hardware guarantee about them will be maintained at the source level). The C11 syntactic notion of dependency, whatever its faults, was at least precise, could be reasoned about locally (just looking at the syntactic code in question), and did do that. The fact that current compilers do optimisations that remove dependencies and will likely have many bugs at present is besides the point - this was surely intended as a *new* constraint on what they are allowed to do. The interesting question is really whether the compiler writers
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive enough for RCU idioms, in which that reasoning doesn't allow data/address dependencies to be removed (and hence the hardware guarantee about them will be maintained at the source level). The C11 syntactic notion of dependency, whatever its faults,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > > wrote: > > > > > > 3. The comparison was against another RCU-protected pointer, > > > where that other pointer was properly fetched using one > > > of the RCU primitives. Here it doesn't matter which pointer > > > you use. At least as long as the rcu_assign_pointer() for > > > that other pointer happened after the last update to the > > > pointed-to structure. > > > > > > I am a bit nervous about #3. Any thoughts on it? > > > > I think that it might be worth pointing out as an example, and saying > > that code like > > > >p = atomic_read(consume); > >X; > >q = atomic_read(consume); > >Y; > >if (p == q) > > data = p->val; > > > > then the access of "p->val" is constrained to be data-dependent on > > *either* p or q, but you can't really tell which, since the compiler > > can decide that the values are interchangeable. > > > > I cannot for the life of me come up with a situation where this would > > matter, though. If "X" contains a fence, then that fence will be a > > stronger ordering than anything the consume through "p" would > > guarantee anyway. And if "X" does *not* contain a fence, then the > > atomic reads of p and q are unordered *anyway*, so then whether the > > ordering to the access through "p" is through p or q is kind of > > irrelevant. No? > > I can make a contrived litmus test for it, but you are right, the only > time you can see it happen is when X has no barriers, in which case > you don't have any ordering anyway -- both the compiler and the CPU can > reorder the loads into p and q, and the read from p->val can, as you say, > come from either pointer. > > For whatever it is worth, hear is the litmus test: > > T1: p = kmalloc(...); > if (p == NULL) > deal_with_it(); > p->a = 42; /* Each field in its own cache line. */ > p->b = 43; > p->c = 44; > atomic_store_explicit(, p, memory_order_release); > p->b = 143; > p->c = 144; > atomic_store_explicit(, p, memory_order_release); > > T2: p = atomic_load_explicit(, memory_order_consume); > r1 = p->b; /* Guaranteed to get 143. */ > q = atomic_load_explicit(, memory_order_consume); > if (p == q) { > /* The compiler decides that q->c is same as p->c. */ > r2 = p->c; /* Could get 44 on weakly order system. */ > } > > The loads from gp1 and gp2 are, as you say, unordered, so you get what > you get. > > And publishing a structure via one RCU-protected pointer, updating it, > then publishing it via another pointer seems to me to be asking for > trouble anyway. If you really want to do something like that and still > see consistency across all the fields in the structure, please put a lock > in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(, memory_order_consume). Thoughts? Thanx, Paul documentation: Record rcu_dereference() value mishandling Recent LKML discussings (see http://lwn.net/Articles/586838/ and http://lwn.net/Articles/588300/ for the LWN writeups) brought out some ways of misusing the return value from rcu_dereference() that are not necessarily completely intuitive. This commit therefore documents what can and cannot safely be done with these values. Signed-off-by: Paul E. McKenney diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index fa57139f50bf..f773a264ae02 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -12,6 +12,8 @@ lockdep-splat.txt - RCU Lockdep splats explained. NMI-RCU.txt - Using RCU to Protect Dynamic NMI Handlers +rcu_dereference.txt + - Proper care and feeding of return values from rcu_dereference() rcubarrier.txt - RCU and Unloadable Modules rculist_nulls.txt diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 9d10d1db16a5..877947130ebe 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! http://www.openvms.compaq.com/wizard/wiz_2637.html The rcu_dereference() primitive is also an excellent - documentation aid, letting the person reading the code - know exactly which pointers are protected by RCU. + documentation aid, letting the person reading the + code know
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? Thanx, Paul documentation: Record rcu_dereference() value mishandling Recent LKML discussings (see http://lwn.net/Articles/586838/ and http://lwn.net/Articles/588300/ for the LWN writeups) brought out some ways of misusing the return value from rcu_dereference() that are not necessarily completely intuitive. This commit therefore documents what can and cannot safely be done with these values. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index fa57139f50bf..f773a264ae02 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -12,6 +12,8 @@ lockdep-splat.txt - RCU Lockdep splats explained. NMI-RCU.txt - Using RCU to Protect Dynamic NMI Handlers +rcu_dereference.txt + - Proper care and feeding of return values from rcu_dereference() rcubarrier.txt - RCU and Unloadable Modules rculist_nulls.txt diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 9d10d1db16a5..877947130ebe 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! http://www.openvms.compaq.com/wizard/wiz_2637.html The rcu_dereference() primitive is also an excellent - documentation aid, letting the person reading the code - know exactly which pointers are protected by RCU. + documentation aid, letting the person reading the + code know exactly which pointers are protected by RCU. Please note
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com > > > > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > > > wrote: > > > > > > > > Good points. How about the following replacements? > > > > > > > > 3. Adding or subtracting an integer to/from a chained pointer > > > > results in another chained pointer in that same pointer chain. > > > > The results of addition and subtraction operations that cancel > > > > the chained pointer's value (for example, "p-(long)p" where "p" > > > > is a pointer to char) are implementation defined. > > > > > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > > > applied to a chained pointer and an integer for the purposes > > > > of alignment and pointer translation results in another > > > > chained pointer in that same pointer chain. Other uses > > > > of bitwise operators on chained pointers (for example, > > > > "p|~0") are implementation defined. > > > > > > Quite frankly, I think all of this language that is about the actual > > > operations is irrelevant and wrong. > > > > > > It's not going to help compiler writers, and it sure isn't going to > > > help users that read this. > > > > > > Why not just talk about "value chains" and that any operations that > > > restrict the value range severely end up breaking the chain. There is > > > no point in listing the operations individually, because every single > > > operation *can* restrict things. Listing individual operations and > > > depdendencies is just fundamentally wrong. > > > > [...] > > > > > The *only* thing that matters for all of them is whether they are > > > "value-preserving", or whether they drop so much information that the > > > compiler might decide to use a control dependency instead. That's true > > > for every single one of them. > > > > > > Similarly, actual true control dependencies that limit the problem > > > space sufficiently that the actual pointer value no longer has > > > significant information in it (see the above example) are also things > > > that remove information to the point that only a control dependency > > > remains. Even when the value itself is not modified in any way at all. > > > > I agree that just considering syntactic properties of the program seems > > to be insufficient. Making it instead depend on whether there is a > > "semantic" dependency due to a value being "necessary" to compute a > > result seems better. However, whether a value is "necessary" might not > > be obvious, and I understand Paul's argument that he does not want to > > have to reason about all potential compiler optimizations. Thus, I > > believe we need to specify when a value is "necessary". > > > > I have a suggestion for a somewhat different formulation of the feature > > that you seem to have in mind, which I'll discuss below. Excuse the > > verbosity of the following, but I'd rather like to avoid > > misunderstandings than save a few words. > > Thank you very much for putting this forward! I must confess that I was > stuck, and my earlier attempt now enshrined in the C11 and C++11 standards > is quite clearly way bogus. > > One possible saving grace: From discussions at the standards committee > meeting a few weeks ago, there is a some chance that the committee will > be willing to do a rip-and-replace on the current memory_order_consume > wording, without provisions for backwards compatibility with the current > bogosity. > > > What we'd like to capture is that a value originating from a mo_consume > > load is "necessary" for a computation (e.g., it "cannot" be replaced > > with value predictions and/or control dependencies); if that's the case > > in the program, we can reasonably assume that a compiler implementation > > will transform this into a data dependency, which will then lead to > > ordering guarantees by the HW. > > > > However, we need to specify when a value is "necessary". We could say > > that this is implementation-defined, and use a set of litmus tests > > (e.g., like those discussed in the thread) to roughly carve out what a > > programmer could expect. This may even be practical for a project like > > the Linux kernel that follows strict project-internal rules and pays a > > lot of attention to what the particular implementations of compilers > > expected to compile the kernel are doing. However, I think this > > approach would be too vague for the standard and for many other > > programs/projects. > > I agree that a number of other projects would have more need for this than > might the kernel. Please understand that this is in no way denigrating > the intelligence of other projects' members. It is just that many of > them have only recently started
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > wrote: > > > > 3. The comparison was against another RCU-protected pointer, > > where that other pointer was properly fetched using one > > of the RCU primitives. Here it doesn't matter which pointer > > you use. At least as long as the rcu_assign_pointer() for > > that other pointer happened after the last update to the > > pointed-to structure. > > > > I am a bit nervous about #3. Any thoughts on it? > > I think that it might be worth pointing out as an example, and saying > that code like > >p = atomic_read(consume); >X; >q = atomic_read(consume); >Y; >if (p == q) > data = p->val; > > then the access of "p->val" is constrained to be data-dependent on > *either* p or q, but you can't really tell which, since the compiler > can decide that the values are interchangeable. > > I cannot for the life of me come up with a situation where this would > matter, though. If "X" contains a fence, then that fence will be a > stronger ordering than anything the consume through "p" would > guarantee anyway. And if "X" does *not* contain a fence, then the > atomic reads of p and q are unordered *anyway*, so then whether the > ordering to the access through "p" is through p or q is kind of > irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p->val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p->a = 42; /* Each field in its own cache line. */ p->b = 43; p->c = 44; atomic_store_explicit(, p, memory_order_release); p->b = 143; p->c = 144; atomic_store_explicit(, p, memory_order_release); T2: p = atomic_load_explicit(, memory_order_consume); r1 = p->b; /* Guaranteed to get 143. */ q = atomic_load_explicit(, memory_order_consume); if (p == q) { /* The compiler decides that q->c is same as p->c. */ r2 = p->c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney wrote: > > 3. The comparison was against another RCU-protected pointer, > where that other pointer was properly fetched using one > of the RCU primitives. Here it doesn't matter which pointer > you use. At least as long as the rcu_assign_pointer() for > that other pointer happened after the last update to the > pointed-to structure. > > I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p->val; then the access of "p->val" is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If "X" contains a fence, then that fence will be a stronger ordering than anything the consume through "p" would guarantee anyway. And if "X" does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through "p" is through p or q is kind of irrelevant. No? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com > > > > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > > > wrote: > > > > > > > > Good points. How about the following replacements? > > > > > > > > 3. Adding or subtracting an integer to/from a chained pointer > > > > results in another chained pointer in that same pointer chain. > > > > The results of addition and subtraction operations that cancel > > > > the chained pointer's value (for example, "p-(long)p" where "p" > > > > is a pointer to char) are implementation defined. > > > > > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > > > applied to a chained pointer and an integer for the purposes > > > > of alignment and pointer translation results in another > > > > chained pointer in that same pointer chain. Other uses > > > > of bitwise operators on chained pointers (for example, > > > > "p|~0") are implementation defined. > > > > > > Quite frankly, I think all of this language that is about the actual > > > operations is irrelevant and wrong. > > > > > > It's not going to help compiler writers, and it sure isn't going to > > > help users that read this. > > > > > > Why not just talk about "value chains" and that any operations that > > > restrict the value range severely end up breaking the chain. There is > > > no point in listing the operations individually, because every single > > > operation *can* restrict things. Listing individual operations and > > > depdendencies is just fundamentally wrong. > > > > [...] > > > > > The *only* thing that matters for all of them is whether they are > > > "value-preserving", or whether they drop so much information that the > > > compiler might decide to use a control dependency instead. That's true > > > for every single one of them. > > > > > > Similarly, actual true control dependencies that limit the problem > > > space sufficiently that the actual pointer value no longer has > > > significant information in it (see the above example) are also things > > > that remove information to the point that only a control dependency > > > remains. Even when the value itself is not modified in any way at all. > > > > I agree that just considering syntactic properties of the program seems > > to be insufficient. Making it instead depend on whether there is a > > "semantic" dependency due to a value being "necessary" to compute a > > result seems better. However, whether a value is "necessary" might not > > be obvious, and I understand Paul's argument that he does not want to > > have to reason about all potential compiler optimizations. Thus, I > > believe we need to specify when a value is "necessary". > > > > I have a suggestion for a somewhat different formulation of the feature > > that you seem to have in mind, which I'll discuss below. Excuse the > > verbosity of the following, but I'd rather like to avoid > > misunderstandings than save a few words. > > Thank you very much for putting this forward! I must confess that I was > stuck, and my earlier attempt now enshrined in the C11 and C++11 standards > is quite clearly way bogus. > > One possible saving grace: From discussions at the standards committee > meeting a few weeks ago, there is a some chance that the committee will > be willing to do a rip-and-replace on the current memory_order_consume > wording, without provisions for backwards compatibility with the current > bogosity. > > > What we'd like to capture is that a value originating from a mo_consume > > load is "necessary" for a computation (e.g., it "cannot" be replaced > > with value predictions and/or control dependencies); if that's the case > > in the program, we can reasonably assume that a compiler implementation > > will transform this into a data dependency, which will then lead to > > ordering guarantees by the HW. > > > > However, we need to specify when a value is "necessary". We could say > > that this is implementation-defined, and use a set of litmus tests > > (e.g., like those discussed in the thread) to roughly carve out what a > > programmer could expect. This may even be practical for a project like > > the Linux kernel that follows strict project-internal rules and pays a > > lot of attention to what the particular implementations of compilers > > expected to compile the kernel are doing. However, I think this > > approach would be too vague for the standard and for many other > > programs/projects. > > I agree that a number of other projects would have more need for this than > might the kernel. Please understand that this is in no way denigrating > the intelligence of other projects' members. It is just that many of > them have only recently started
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel wrote: > > > > I agree that just considering syntactic properties of the program seems > > to be insufficient. Making it instead depend on whether there is a > > "semantic" dependency due to a value being "necessary" to compute a > > result seems better. However, whether a value is "necessary" might not > > be obvious, and I understand Paul's argument that he does not want to > > have to reason about all potential compiler optimizations. Thus, I > > believe we need to specify when a value is "necessary". > > I suspect it's hard to really strictly define, but at the same time I > actually think that compiler writers (and users, for that matter) have > little problem understanding the concept and intent. > > I do think that listing operations might be useful to give good > examples of what is a "necessary" value, and - perhaps more > importantly - what can break the value from being "necessary". > Especially the gotchas. > > > I have a suggestion for a somewhat different formulation of the feature > > that you seem to have in mind, which I'll discuss below. Excuse the > > verbosity of the following, but I'd rather like to avoid > > misunderstandings than save a few words. > > Ok, I'm going to cut most of the verbiage since it's long and I'm not > commenting on most of it. > > But > > > Based on these thoughts, we could specify the new mo_consume guarantees > > roughly as follows: > > > > An evaluation E (in an execution) has a value dependency to an > > atomic and mo_consume load L (in an execution) iff: > > * L's type holds more than one value (ruling out constants > > etc.), > > * L is sequenced-before E, > > * L's result is used by the abstract machine to compute E, > > * E is value-dependency-preserving code (defined below), and > > * at the time of execution of E, L can possibly have returned at > > least two different values under the assumption that L itself > > could have returned any value allowed by L's type. > > > > If a memory access A's targeted memory location has a value > > dependency on a mo_consume load L, and an action X > > inter-thread-happens-before L, then X happens-before A. > > I think this mostly works. > > > Regarding the latter, we make a fresh start at each mo_consume load (ie, > > we assume we know nothing -- L could have returned any possible value); > > I believe this is easier to reason about than other scopes like function > > granularities (what happens on inlining?), or translation units. It > > should also be simple to implement for compilers, and would hopefully > > not constrain optimization too much. > > > > [...] > > > > Paul's litmus test would work, because we guarantee to the programmer > > that it can assume that the mo_consume load would return any value > > allowed by the type; effectively, this forbids the compiler analysis > > Paul thought about: > > So realistically, since with the new wording we can ignore the silly > cases (ie "p-p") and we can ignore the trivial-to-optimize compiler > cases ("if (p == ) .. use p"), and you would forbid the > "global value range optimization case" that Paul bright up, what > remains would seem to be just really subtle compiler transformations > of data dependencies to control dependencies. FWIW, I am looking through the kernel for instances of your first "if (p == ) .. use p" limus test. All the ones I have found thus far are OK for one of the following reasons: 1. The comparison was against NULL, so you don't get to dereference the pointer anyway. About 80% are in this category. 2. The comparison was against another pointer, but there were no dereferences afterwards. Here is an example of what these can look like: list_for_each_entry_rcu(p, , next) if (p == ) return; /* "p" goes out of scope. */ 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? Some other reasons why it would be OK to dereference after a comparison: 4. The pointed-to data is constant: (a) It was initialized at boot time, (b) the update-side lock is held, (c) we are running in a kthread and the data was initialized before the kthread was created, (d) we are running in a module, and the data was initialized during or before module-init time for that module. And many more besides, involving pretty much every
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com > > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > > wrote: > > > > > > Good points. How about the following replacements? > > > > > > 3. Adding or subtracting an integer to/from a chained pointer > > > results in another chained pointer in that same pointer chain. > > > The results of addition and subtraction operations that cancel > > > the chained pointer's value (for example, "p-(long)p" where "p" > > > is a pointer to char) are implementation defined. > > > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > > applied to a chained pointer and an integer for the purposes > > > of alignment and pointer translation results in another > > > chained pointer in that same pointer chain. Other uses > > > of bitwise operators on chained pointers (for example, > > > "p|~0") are implementation defined. > > > > Quite frankly, I think all of this language that is about the actual > > operations is irrelevant and wrong. > > > > It's not going to help compiler writers, and it sure isn't going to > > help users that read this. > > > > Why not just talk about "value chains" and that any operations that > > restrict the value range severely end up breaking the chain. There is > > no point in listing the operations individually, because every single > > operation *can* restrict things. Listing individual operations and > > depdendencies is just fundamentally wrong. > > [...] > > > The *only* thing that matters for all of them is whether they are > > "value-preserving", or whether they drop so much information that the > > compiler might decide to use a control dependency instead. That's true > > for every single one of them. > > > > Similarly, actual true control dependencies that limit the problem > > space sufficiently that the actual pointer value no longer has > > significant information in it (see the above example) are also things > > that remove information to the point that only a control dependency > > remains. Even when the value itself is not modified in any way at all. > > I agree that just considering syntactic properties of the program seems > to be insufficient. Making it instead depend on whether there is a > "semantic" dependency due to a value being "necessary" to compute a > result seems better. However, whether a value is "necessary" might not > be obvious, and I understand Paul's argument that he does not want to > have to reason about all potential compiler optimizations. Thus, I > believe we need to specify when a value is "necessary". > > I have a suggestion for a somewhat different formulation of the feature > that you seem to have in mind, which I'll discuss below. Excuse the > verbosity of the following, but I'd rather like to avoid > misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. > What we'd like to capture is that a value originating from a mo_consume > load is "necessary" for a computation (e.g., it "cannot" be replaced > with value predictions and/or control dependencies); if that's the case > in the program, we can reasonably assume that a compiler implementation > will transform this into a data dependency, which will then lead to > ordering guarantees by the HW. > > However, we need to specify when a value is "necessary". We could say > that this is implementation-defined, and use a set of litmus tests > (e.g., like those discussed in the thread) to roughly carve out what a > programmer could expect. This may even be practical for a project like > the Linux kernel that follows strict project-internal rules and pays a > lot of attention to what the particular implementations of compilers > expected to compile the kernel are doing. However, I think this > approach would be too vague for the standard and for many other > programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler and from elsewhere as
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel wrote: > > I agree that just considering syntactic properties of the program seems > to be insufficient. Making it instead depend on whether there is a > "semantic" dependency due to a value being "necessary" to compute a > result seems better. However, whether a value is "necessary" might not > be obvious, and I understand Paul's argument that he does not want to > have to reason about all potential compiler optimizations. Thus, I > believe we need to specify when a value is "necessary". I suspect it's hard to really strictly define, but at the same time I actually think that compiler writers (and users, for that matter) have little problem understanding the concept and intent. I do think that listing operations might be useful to give good examples of what is a "necessary" value, and - perhaps more importantly - what can break the value from being "necessary". Especially the gotchas. > I have a suggestion for a somewhat different formulation of the feature > that you seem to have in mind, which I'll discuss below. Excuse the > verbosity of the following, but I'd rather like to avoid > misunderstandings than save a few words. Ok, I'm going to cut most of the verbiage since it's long and I'm not commenting on most of it. But > Based on these thoughts, we could specify the new mo_consume guarantees > roughly as follows: > > An evaluation E (in an execution) has a value dependency to an > atomic and mo_consume load L (in an execution) iff: > * L's type holds more than one value (ruling out constants > etc.), > * L is sequenced-before E, > * L's result is used by the abstract machine to compute E, > * E is value-dependency-preserving code (defined below), and > * at the time of execution of E, L can possibly have returned at > least two different values under the assumption that L itself > could have returned any value allowed by L's type. > > If a memory access A's targeted memory location has a value > dependency on a mo_consume load L, and an action X > inter-thread-happens-before L, then X happens-before A. I think this mostly works. > Regarding the latter, we make a fresh start at each mo_consume load (ie, > we assume we know nothing -- L could have returned any possible value); > I believe this is easier to reason about than other scopes like function > granularities (what happens on inlining?), or translation units. It > should also be simple to implement for compilers, and would hopefully > not constrain optimization too much. > > [...] > > Paul's litmus test would work, because we guarantee to the programmer > that it can assume that the mo_consume load would return any value > allowed by the type; effectively, this forbids the compiler analysis > Paul thought about: So realistically, since with the new wording we can ignore the silly cases (ie "p-p") and we can ignore the trivial-to-optimize compiler cases ("if (p == ) .. use p"), and you would forbid the "global value range optimization case" that Paul bright up, what remains would seem to be just really subtle compiler transformations of data dependencies to control dependencies. And the only such thing I can think of is basically compiler-initiated value-prediction, presumably directed by PGO (since now if the value prediction is in the source code, it's considered to break the value chain). The good thing is that afaik, value-prediction is largely not used in real life, afaik. There are lots of papers on it, but I don't think anybody actually does it (although I can easily see some specint-specific optimization pattern that is build up around it). And even value prediction is actually fine, as long as the compiler can see the memory *source* of the value prediction (and it isn't a mo_consume). So it really ends up limiting your value prediction in very simple ways: you cannot do it to function arguments if they are registers. But you can still do value prediction on values you loaded from memory, if you can actually *see* that memory op. Of course, on more strongly ordered CPU's, even that "register argument" limitation goes away. So I agree that there is basically no real optimization constraint. Value-prediction is of dubious value to begin with, and the actual constraint on its use if some compiler writer really wants to is not onerous. > What I have in mind is roughly the following (totally made-up syntax -- > suggestions for how to do this properly are very welcome): > * Have a type modifier (eg, like restrict), that specifies that > operations on data of this type are preserving value dependencies: So I'm not violently opposed, but I think the upsides are not great. Note that my earlier suggestion to use "restrict" wasn't because I believed the annotation itself would be visible, but basically just as a legalistic promise to the compiler that *if* it found an alias,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > wrote: > > > > Good points. How about the following replacements? > > > > 3. Adding or subtracting an integer to/from a chained pointer > > results in another chained pointer in that same pointer chain. > > The results of addition and subtraction operations that cancel > > the chained pointer's value (for example, "p-(long)p" where "p" > > is a pointer to char) are implementation defined. > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > applied to a chained pointer and an integer for the purposes > > of alignment and pointer translation results in another > > chained pointer in that same pointer chain. Other uses > > of bitwise operators on chained pointers (for example, > > "p|~0") are implementation defined. > > Quite frankly, I think all of this language that is about the actual > operations is irrelevant and wrong. > > It's not going to help compiler writers, and it sure isn't going to > help users that read this. > > Why not just talk about "value chains" and that any operations that > restrict the value range severely end up breaking the chain. There is > no point in listing the operations individually, because every single > operation *can* restrict things. Listing individual operations and > depdendencies is just fundamentally wrong. [...] > The *only* thing that matters for all of them is whether they are > "value-preserving", or whether they drop so much information that the > compiler might decide to use a control dependency instead. That's true > for every single one of them. > > Similarly, actual true control dependencies that limit the problem > space sufficiently that the actual pointer value no longer has > significant information in it (see the above example) are also things > that remove information to the point that only a control dependency > remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a "semantic" dependency due to a value being "necessary" to compute a result seems better. However, whether a value is "necessary" might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is "necessary". I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. What we'd like to capture is that a value originating from a mo_consume load is "necessary" for a computation (e.g., it "cannot" be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is "necessary". We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. One way to understand "necessary" would be to say that if a mo_consume load can result in more than V different values, then the actual value is "unknown", and thus "necessary" to compute anything based on it. (But this is flawed, as discussed below.) However, how big should V be? If it's larger than 1, atomic bool cannot be used with mo_consume, which seems weird. If V is 1, then Linus' litmus tests work (but Paul's doesn't; see below), but the compiler must not try to predict more than one value. This holds for any choice of V, so there always is an *additional* constraint on code generation for operations that are meant to take part in such "value dependencies". The bigger V might be, the less likely it should be for this to actually constrain a particular compiler's optimizations (e.g., while it might be meaningful to use value prediction for two or three values, it's probably not for 1000s). Nonetheless, if we don't want to predict the future, we need to specify V. Given that we always have some constraint for code generation anyway, and given that V > 1 might be an arbitrary-looking constraint and disallows use on atomic bool, I believe V should be 1. Furthermore, there is a problem in saying "a load can
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. One way to understand necessary would be to say that if a mo_consume load can result in more than V different values, then the actual value is unknown, and thus necessary to compute anything based on it. (But this is flawed, as discussed below.) However, how big should V be? If it's larger than 1, atomic bool cannot be used with mo_consume, which seems weird. If V is 1, then Linus' litmus tests work (but Paul's doesn't; see below), but the compiler must not try to predict more than one value. This holds for any choice of V, so there always is an *additional* constraint on code generation for operations that are meant to take part in such value dependencies. The bigger V might be, the less likely it should be for this to actually constrain a particular compiler's optimizations (e.g., while it might be meaningful to use value prediction for two or three values, it's probably not for 1000s). Nonetheless, if we don't want to predict the future, we need to specify V. Given that we always have some constraint for code generation anyway, and given that V 1 might be an arbitrary-looking constraint and disallows use on atomic bool, I believe V should be 1. Furthermore, there is a problem in saying a load can result in more than one value because in a deterministic
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote: I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I suspect it's hard to really strictly define, but at the same time I actually think that compiler writers (and users, for that matter) have little problem understanding the concept and intent. I do think that listing operations might be useful to give good examples of what is a necessary value, and - perhaps more importantly - what can break the value from being necessary. Especially the gotchas. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Ok, I'm going to cut most of the verbiage since it's long and I'm not commenting on most of it. But Based on these thoughts, we could specify the new mo_consume guarantees roughly as follows: An evaluation E (in an execution) has a value dependency to an atomic and mo_consume load L (in an execution) iff: * L's type holds more than one value (ruling out constants etc.), * L is sequenced-before E, * L's result is used by the abstract machine to compute E, * E is value-dependency-preserving code (defined below), and * at the time of execution of E, L can possibly have returned at least two different values under the assumption that L itself could have returned any value allowed by L's type. If a memory access A's targeted memory location has a value dependency on a mo_consume load L, and an action X inter-thread-happens-before L, then X happens-before A. I think this mostly works. Regarding the latter, we make a fresh start at each mo_consume load (ie, we assume we know nothing -- L could have returned any possible value); I believe this is easier to reason about than other scopes like function granularities (what happens on inlining?), or translation units. It should also be simple to implement for compilers, and would hopefully not constrain optimization too much. [...] Paul's litmus test would work, because we guarantee to the programmer that it can assume that the mo_consume load would return any value allowed by the type; effectively, this forbids the compiler analysis Paul thought about: So realistically, since with the new wording we can ignore the silly cases (ie p-p) and we can ignore the trivial-to-optimize compiler cases (if (p == variable) .. use p), and you would forbid the global value range optimization case that Paul bright up, what remains would seem to be just really subtle compiler transformations of data dependencies to control dependencies. And the only such thing I can think of is basically compiler-initiated value-prediction, presumably directed by PGO (since now if the value prediction is in the source code, it's considered to break the value chain). The good thing is that afaik, value-prediction is largely not used in real life, afaik. There are lots of papers on it, but I don't think anybody actually does it (although I can easily see some specint-specific optimization pattern that is build up around it). And even value prediction is actually fine, as long as the compiler can see the memory *source* of the value prediction (and it isn't a mo_consume). So it really ends up limiting your value prediction in very simple ways: you cannot do it to function arguments if they are registers. But you can still do value prediction on values you loaded from memory, if you can actually *see* that memory op. Of course, on more strongly ordered CPU's, even that register argument limitation goes away. So I agree that there is basically no real optimization constraint. Value-prediction is of dubious value to begin with, and the actual constraint on its use if some compiler writer really wants to is not onerous. What I have in mind is roughly the following (totally made-up syntax -- suggestions for how to do this properly are very welcome): * Have a type modifier (eg, like restrict), that specifies that operations on data of this type are preserving value dependencies: So I'm not violently opposed, but I think the upsides are not great. Note that my earlier suggestion to use restrict wasn't because I believed the annotation itself would be visible, but basically just as a legalistic promise to the compiler that *if* it found an alias, then it didn't need to worry about
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler and from elsewhere as well. Your proposal looks quite promising at first glance. But rather than try and comment on it immediately, I am going to take a number
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote: I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I suspect it's hard to really strictly define, but at the same time I actually think that compiler writers (and users, for that matter) have little problem understanding the concept and intent. I do think that listing operations might be useful to give good examples of what is a necessary value, and - perhaps more importantly - what can break the value from being necessary. Especially the gotchas. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Ok, I'm going to cut most of the verbiage since it's long and I'm not commenting on most of it. But Based on these thoughts, we could specify the new mo_consume guarantees roughly as follows: An evaluation E (in an execution) has a value dependency to an atomic and mo_consume load L (in an execution) iff: * L's type holds more than one value (ruling out constants etc.), * L is sequenced-before E, * L's result is used by the abstract machine to compute E, * E is value-dependency-preserving code (defined below), and * at the time of execution of E, L can possibly have returned at least two different values under the assumption that L itself could have returned any value allowed by L's type. If a memory access A's targeted memory location has a value dependency on a mo_consume load L, and an action X inter-thread-happens-before L, then X happens-before A. I think this mostly works. Regarding the latter, we make a fresh start at each mo_consume load (ie, we assume we know nothing -- L could have returned any possible value); I believe this is easier to reason about than other scopes like function granularities (what happens on inlining?), or translation units. It should also be simple to implement for compilers, and would hopefully not constrain optimization too much. [...] Paul's litmus test would work, because we guarantee to the programmer that it can assume that the mo_consume load would return any value allowed by the type; effectively, this forbids the compiler analysis Paul thought about: So realistically, since with the new wording we can ignore the silly cases (ie p-p) and we can ignore the trivial-to-optimize compiler cases (if (p == variable) .. use p), and you would forbid the global value range optimization case that Paul bright up, what remains would seem to be just really subtle compiler transformations of data dependencies to control dependencies. FWIW, I am looking through the kernel for instances of your first if (p == variable) .. use p limus test. All the ones I have found thus far are OK for one of the following reasons: 1. The comparison was against NULL, so you don't get to dereference the pointer anyway. About 80% are in this category. 2. The comparison was against another pointer, but there were no dereferences afterwards. Here is an example of what these can look like: list_for_each_entry_rcu(p, head, next) if (p == variable) return; /* p goes out of scope. */ 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? Some other reasons why it would be OK to dereference after a comparison: 4. The pointed-to data is constant: (a) It was initialized at boot time, (b) the update-side lock is held, (c) we are running in a kthread and the data was initialized before the kthread was created, (d) we are running in a module, and the data was initialized during or before module-init time for that module. And many more besides, involving pretty much every kernel primitive that makes something run later. 5. All subsequent dereferences
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-26 at 18:43 +, Joseph S. Myers wrote: > On Wed, 26 Feb 2014, Torvald Riegel wrote: > > > On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: > > > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > > > > > This needs to be as follows: > > > > > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > > > { > > > > return i - i; > > > > } > > > > > > > > Otherwise dependencies won't get carried through it. > > > > > > C11 doesn't have attributes at all (and no specification regarding calls > > > and dependencies that I can see). And the way I read the C++11 > > > specification of carries_dependency is that specifying carries_dependency > > > is purely about increasing optimization of the caller: that if it isn't > > > specified, then the caller doesn't know what dependencies might be > > > carried. "Note: The carries_dependency attribute does not change the > > > meaning of the program, but may result in generation of more efficient > > > code. - end note". > > > > I think that this last sentence can be kind of misleading, especially > > when looking at it from an implementation point of view. How > > dependencies are handled (ie, preserving the syntactic dependencies vs. > > emitting barriers) must be part of the ABI, or things like > > [[carries_dependency]] won't work as expected (or lead to inefficient > > code). Thus, in practice, all compiler vendors on a platform would have > > to agree to a particular handling, which might end up in selecting the > > easy-but-conservative implementation option (ie, always emitting > > mo_acquire when the source uses mo_consume). > > Regardless of the ABI, my point is that if a program is valid, it is also > valid when all uses of [[carries_dependency]] are removed. If a function > doesn't use [[carries_dependency]], that means "dependencies may or may > not be carried through this function". If a function uses > [[carries_dependency]], that means that certain dependencies *are* carried > through the function (and the ABI should then specify what this means the > caller can rely on, in terms of the architecture's memory model). (This > may or may not be useful, but it's how I understand C++11.) I agree. What I tried to point out is that it's not the case that an *implementation* can just ignore [[carries_dependency]]. So from an implementation perspective, the attribute does have semantics. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 26 Feb 2014, Torvald Riegel wrote: > On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: > > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > > > This needs to be as follows: > > > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > > { > > > return i - i; > > > } > > > > > > Otherwise dependencies won't get carried through it. > > > > C11 doesn't have attributes at all (and no specification regarding calls > > and dependencies that I can see). And the way I read the C++11 > > specification of carries_dependency is that specifying carries_dependency > > is purely about increasing optimization of the caller: that if it isn't > > specified, then the caller doesn't know what dependencies might be > > carried. "Note: The carries_dependency attribute does not change the > > meaning of the program, but may result in generation of more efficient > > code. - end note". > > I think that this last sentence can be kind of misleading, especially > when looking at it from an implementation point of view. How > dependencies are handled (ie, preserving the syntactic dependencies vs. > emitting barriers) must be part of the ABI, or things like > [[carries_dependency]] won't work as expected (or lead to inefficient > code). Thus, in practice, all compiler vendors on a platform would have > to agree to a particular handling, which might end up in selecting the > easy-but-conservative implementation option (ie, always emitting > mo_acquire when the source uses mo_consume). Regardless of the ABI, my point is that if a program is valid, it is also valid when all uses of [[carries_dependency]] are removed. If a function doesn't use [[carries_dependency]], that means "dependencies may or may not be carried through this function". If a function uses [[carries_dependency]], that means that certain dependencies *are* carried through the function (and the ABI should then specify what this means the caller can rely on, in terms of the architecture's memory model). (This may or may not be useful, but it's how I understand C++11.) -- Joseph S. Myers jos...@codesourcery.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote: > xagsmtp2.20140226130517.3...@vmsdvma.vnet.ibm.com > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: > > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: > > > Hi, > > > > > > On Thu, 20 Feb 2014, Linus Torvalds wrote: > > > > > > > But I'm pretty sure that any compiler guy must *hate* that current odd > > > > dependency-generation part, and if I was a gcc person, seeing that > > > > bugzilla entry Torvald pointed at, I would personally want to > > > > dismember somebody with a rusty spoon.. > > > > > > Yes. Defining dependency chains in the way the standard currently seems > > > to do must come from people not writing compilers. There's simply no > > > sensible way to implement it without being really conservative, because > > > the depchains can contain arbitrary constructs including stores, > > > loads and function calls but must still be observed. > > > > > > And with conservative I mean "everything is a source of a dependency, and > > > hence can't be removed, reordered or otherwise fiddled with", and that > > > includes code sequences where no atomic objects are anywhere in sight [1]. > > > In the light of that the only realistic way (meaning to not have to > > > disable optimization everywhere) to implement consume as currently > > > specified is to map it to acquire. At which point it becomes pointless. > > > > No, only memory_order_consume loads and [[carries_dependency]] > > function arguments are sources of dependency chains. > > However, that is, given how the standard specifies things, just one of > the possible ways for how an implementation can handle this. Treating > [[carries_dependency]] as a "necessary" annotation to make exploiting > mo_consume work in practice is possible, but it's not required by the > standard. > > Also, dependencies are specified to flow through loads and stores > (restricted to scalar objects and bitfields), so any load that might > load from a dependency-carrying store can also be a source (and that > doesn't seem to be restricted by [[carries_dependency]]). OK, this last is clearly totally unacceptable. :-/ Leaving aside the option of dropping the whole thing for the moment, the only thing that suggests itself is having all dependencies die at a specific point in the code, corresponding to the rcu_read_unlock(). But as far as I can see, that absolutely requires "necessary" parameter and return marking in order to correctly handle nested RCU read-side critical sections in different functions. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote: > On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: > > Hi, > > > > On Mon, 24 Feb 2014, Linus Torvalds wrote: > > > > > > To me that reads like > > > > > > > > int i; > > > > int *q = > > > > int **p = > > > > > > > > atomic_XXX (p, CONSUME); > > > > > > > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > > > > want to say that it orders against aliased storage - but then go further > > > > and include "indirectly through a chain of pointers"?! Thus an > > > > atomic read of a int * orders against any 'int' memory operation but > > > > not against 'float' memory operations? > > > > > > No, it's not about type at all, and the "chain of pointers" can be > > > much more complex than that, since the "int *" can point to within an > > > object that contains other things than just that "int" (the "int" can > > > be part of a structure that then has pointers to other structures > > > etc). > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > > > A * adjustptr (B *ptr) { return >here->there->next; } > > B * p = atomic_XXX (, consume); > > adjustptr(p)->prev->stuff = bla; > > > > As far as I understood you, this whole ptrderef chain business would be > > only an optimization opportunity, right? So if the compiler can't be sure > > how p is actually used (as in my function-using case, assume adjustptr is > > defined in another unit), then the consume load would simply be > > transformed into an acquire (or whatever, with some barrier I mean)? Only > > _if_ the compiler sees all obvious uses of p (indirectly through pointer > > derefs) can it, yeah, do what with the consume load? > > Good point, I left that out of my list. Adding it: > > 13. By default, pointer chains do not propagate into or out of functions. > In implementations having attributes, a [[carries_dependency]] > may be used to mark a function argument or return as passing > a pointer chain into or out of that function. > > If a function does not contain memory_order_consume loads and > also does not contain [[carries_dependency]] attributes, then > that function may be compiled using any desired dependency-breaking > optimizations. > > The ordering effects are implementation defined when a given > pointer chain passes into or out of a function through a parameter > or return not marked with a [[carries_dependency]] attributed. > > Note that this last paragraph differs from the current standard, which > would require ordering regardless. I would prefer if we could get rid off [[carries_dependency]] as well; currently, it's a hint whose effectiveness really depends on how the particular implementation handles this attribute. If we still need something like it in the future, it would be good if it had a clearer use and performance effects. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz wrote: > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > It's supposed to be ordered wrt the first load (the consuming one), yes. > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > No problem. > > The thing is, the ordering is actually handled by the CPU in all > relevant cases. So the compiler doesn't actually need to *do* > anything. All this legalistic stuff is just to describe the semantics > and the guarantees. > > The problem is two cases: > > (a) alpha (which doesn't really order any accesses at all, not even > dependent loads), but for a compiler alpha is actually trivial: just > add a "rmb" instruction after the load, and you can't really do > anything else (there's a few optimizations you can do wrt the rmb, but > they are very specific and simple). > > So (a) is a "problem", but the solution is actually really simple, and > gives very *strong* guarantees: on alpha, a "consume" ends up being > basically the same as a read barrier after the load, with only very > minimal room for optimization. > > (b) ARM and powerpc and similar architectures, that guarantee the > data dependency as long as it is an *actual* data dependency, and > never becomes a control dependency. > > On ARM and powerpc, control dependencies do *not* order accesses (the > reasons boil down to essentially: branch prediction breaks the > dependency, and instructions that come after the branch can be happily > executed before the branch). But it's almost impossible to describe > that in the standard, since compilers can (and very much do) turn a > control dependency into a data dependency and vice versa. > > So the current standard tries to describe that "control vs data" > dependency, and tries to limit it to a data dependency. It fails. It > fails for multiple reasons - it doesn't allow for trivial > optimizations that just remove the data dependency, and it also > doesn't allow for various trivial cases where the compiler *does* turn > the data dependency into a control dependency. > > So I really really think that the current C standard language is > broken. Unfixably broken. > > I'm trying to change the "syntactic data dependency" that the current > standard uses into something that is clearer and correct. > > The "chain of pointers" thing is still obviously a data dependency, > but by limiting it to pointers, it simplifies the language, clarifies > the meaning, avoids all syntactic tricks (ie "p-p" is clearly a > syntactic dependency on "p", but does *not* involve in any way > following the pointer) and makes it basically impossible for the > compiler to break the dependency without doing value prediction, and > since value prediction has to be disallowed anyway, that's a feature, > not a bug. AFAIU, Michael is wondering about how we can separate non-synchronizing code (ie, in this case, not taking part in any "chain of pointers" used with mo_consume loads) from code that does. If we cannot, then we prevent value prediction *everywhere*, unless the compiler can prove that the code is never part of such a chain (which is hard due to alias analysis being hard, etc.). (We can probably argue to which extent value prediction is necessary for generation of efficient code, but it obviously does work in non-synchronizing code (or even with acquire barriers with some care) -- so forbidding it entirely might be bad.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > This needs to be as follows: > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > { > > return i - i; > > } > > > > Otherwise dependencies won't get carried through it. > > C11 doesn't have attributes at all (and no specification regarding calls > and dependencies that I can see). And the way I read the C++11 > specification of carries_dependency is that specifying carries_dependency > is purely about increasing optimization of the caller: that if it isn't > specified, then the caller doesn't know what dependencies might be > carried. "Note: The carries_dependency attribute does not change the > meaning of the program, but may result in generation of more efficient > code. - end note". I think that this last sentence can be kind of misleading, especially when looking at it from an implementation point of view. How dependencies are handled (ie, preserving the syntactic dependencies vs. emitting barriers) must be part of the ABI, or things like [[carries_dependency]] won't work as expected (or lead to inefficient code). Thus, in practice, all compiler vendors on a platform would have to agree to a particular handling, which might end up in selecting the easy-but-conservative implementation option (ie, always emitting mo_acquire when the source uses mo_consume). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: > > Hi, > > > > On Thu, 20 Feb 2014, Linus Torvalds wrote: > > > > > But I'm pretty sure that any compiler guy must *hate* that current odd > > > dependency-generation part, and if I was a gcc person, seeing that > > > bugzilla entry Torvald pointed at, I would personally want to > > > dismember somebody with a rusty spoon.. > > > > Yes. Defining dependency chains in the way the standard currently seems > > to do must come from people not writing compilers. There's simply no > > sensible way to implement it without being really conservative, because > > the depchains can contain arbitrary constructs including stores, > > loads and function calls but must still be observed. > > > > And with conservative I mean "everything is a source of a dependency, and > > hence can't be removed, reordered or otherwise fiddled with", and that > > includes code sequences where no atomic objects are anywhere in sight [1]. > > In the light of that the only realistic way (meaning to not have to > > disable optimization everywhere) to implement consume as currently > > specified is to map it to acquire. At which point it becomes pointless. > > No, only memory_order_consume loads and [[carries_dependency]] > function arguments are sources of dependency chains. However, that is, given how the standard specifies things, just one of the possible ways for how an implementation can handle this. Treating [[carries_dependency]] as a "necessary" annotation to make exploiting mo_consume work in practice is possible, but it's not required by the standard. Also, dependencies are specified to flow through loads and stores (restricted to scalar objects and bitfields), so any load that might load from a dependency-carrying store can also be a source (and that doesn't seem to be restricted by [[carries_dependency]]). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: Hi, On Thu, 20 Feb 2014, Linus Torvalds wrote: But I'm pretty sure that any compiler guy must *hate* that current odd dependency-generation part, and if I was a gcc person, seeing that bugzilla entry Torvald pointed at, I would personally want to dismember somebody with a rusty spoon.. Yes. Defining dependency chains in the way the standard currently seems to do must come from people not writing compilers. There's simply no sensible way to implement it without being really conservative, because the depchains can contain arbitrary constructs including stores, loads and function calls but must still be observed. And with conservative I mean everything is a source of a dependency, and hence can't be removed, reordered or otherwise fiddled with, and that includes code sequences where no atomic objects are anywhere in sight [1]. In the light of that the only realistic way (meaning to not have to disable optimization everywhere) to implement consume as currently specified is to map it to acquire. At which point it becomes pointless. No, only memory_order_consume loads and [[carries_dependency]] function arguments are sources of dependency chains. However, that is, given how the standard specifies things, just one of the possible ways for how an implementation can handle this. Treating [[carries_dependency]] as a necessary annotation to make exploiting mo_consume work in practice is possible, but it's not required by the standard. Also, dependencies are specified to flow through loads and stores (restricted to scalar objects and bitfields), so any load that might load from a dependency-carrying store can also be a source (and that doesn't seem to be restricted by [[carries_dependency]]). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: On Fri, 21 Feb 2014, Paul E. McKenney wrote: This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. C11 doesn't have attributes at all (and no specification regarding calls and dependencies that I can see). And the way I read the C++11 specification of carries_dependency is that specifying carries_dependency is purely about increasing optimization of the caller: that if it isn't specified, then the caller doesn't know what dependencies might be carried. Note: The carries_dependency attribute does not change the meaning of the program, but may result in generation of more efficient code. - end note. I think that this last sentence can be kind of misleading, especially when looking at it from an implementation point of view. How dependencies are handled (ie, preserving the syntactic dependencies vs. emitting barriers) must be part of the ABI, or things like [[carries_dependency]] won't work as expected (or lead to inefficient code). Thus, in practice, all compiler vendors on a platform would have to agree to a particular handling, which might end up in selecting the easy-but-conservative implementation option (ie, always emitting mo_acquire when the source uses mo_consume). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz m...@suse.de wrote: So, let me try to poke holes into your definition or increase my understanding :) . You said chain of pointers(dereferences I assume), e.g. if p is result of consume load, then access to p-here-there-next-prev-stuff is supposed to be ordered with that load (or only when that last load/store itself is also an atomic load or store?). It's supposed to be ordered wrt the first load (the consuming one), yes. So, what happens if the pointer deref chain is partly hidden in some functions: No problem. The thing is, the ordering is actually handled by the CPU in all relevant cases. So the compiler doesn't actually need to *do* anything. All this legalistic stuff is just to describe the semantics and the guarantees. The problem is two cases: (a) alpha (which doesn't really order any accesses at all, not even dependent loads), but for a compiler alpha is actually trivial: just add a rmb instruction after the load, and you can't really do anything else (there's a few optimizations you can do wrt the rmb, but they are very specific and simple). So (a) is a problem, but the solution is actually really simple, and gives very *strong* guarantees: on alpha, a consume ends up being basically the same as a read barrier after the load, with only very minimal room for optimization. (b) ARM and powerpc and similar architectures, that guarantee the data dependency as long as it is an *actual* data dependency, and never becomes a control dependency. On ARM and powerpc, control dependencies do *not* order accesses (the reasons boil down to essentially: branch prediction breaks the dependency, and instructions that come after the branch can be happily executed before the branch). But it's almost impossible to describe that in the standard, since compilers can (and very much do) turn a control dependency into a data dependency and vice versa. So the current standard tries to describe that control vs data dependency, and tries to limit it to a data dependency. It fails. It fails for multiple reasons - it doesn't allow for trivial optimizations that just remove the data dependency, and it also doesn't allow for various trivial cases where the compiler *does* turn the data dependency into a control dependency. So I really really think that the current C standard language is broken. Unfixably broken. I'm trying to change the syntactic data dependency that the current standard uses into something that is clearer and correct. The chain of pointers thing is still obviously a data dependency, but by limiting it to pointers, it simplifies the language, clarifies the meaning, avoids all syntactic tricks (ie p-p is clearly a syntactic dependency on p, but does *not* involve in any way following the pointer) and makes it basically impossible for the compiler to break the dependency without doing value prediction, and since value prediction has to be disallowed anyway, that's a feature, not a bug. AFAIU, Michael is wondering about how we can separate non-synchronizing code (ie, in this case, not taking part in any chain of pointers used with mo_consume loads) from code that does. If we cannot, then we prevent value prediction *everywhere*, unless the compiler can prove that the code is never part of such a chain (which is hard due to alias analysis being hard, etc.). (We can probably argue to which extent value prediction is necessary for generation of efficient code, but it obviously does work in non-synchronizing code (or even with acquire barriers with some care) -- so forbidding it entirely might be bad.) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote: On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: Hi, On Mon, 24 Feb 2014, Linus Torvalds wrote: To me that reads like int i; int *q = i; int **p = q; atomic_XXX (p, CONSUME); orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they want to say that it orders against aliased storage - but then go further and include indirectly through a chain of pointers?! Thus an atomic read of a int * orders against any 'int' memory operation but not against 'float' memory operations? No, it's not about type at all, and the chain of pointers can be much more complex than that, since the int * can point to within an object that contains other things than just that int (the int can be part of a structure that then has pointers to other structures etc). So, let me try to poke holes into your definition or increase my understanding :) . You said chain of pointers(dereferences I assume), e.g. if p is result of consume load, then access to p-here-there-next-prev-stuff is supposed to be ordered with that load (or only when that last load/store itself is also an atomic load or store?). So, what happens if the pointer deref chain is partly hidden in some functions: A * adjustptr (B *ptr) { return ptr-here-there-next; } B * p = atomic_XXX (somewhere, consume); adjustptr(p)-prev-stuff = bla; As far as I understood you, this whole ptrderef chain business would be only an optimization opportunity, right? So if the compiler can't be sure how p is actually used (as in my function-using case, assume adjustptr is defined in another unit), then the consume load would simply be transformed into an acquire (or whatever, with some barrier I mean)? Only _if_ the compiler sees all obvious uses of p (indirectly through pointer derefs) can it, yeah, do what with the consume load? Good point, I left that out of my list. Adding it: 13. By default, pointer chains do not propagate into or out of functions. In implementations having attributes, a [[carries_dependency]] may be used to mark a function argument or return as passing a pointer chain into or out of that function. If a function does not contain memory_order_consume loads and also does not contain [[carries_dependency]] attributes, then that function may be compiled using any desired dependency-breaking optimizations. The ordering effects are implementation defined when a given pointer chain passes into or out of a function through a parameter or return not marked with a [[carries_dependency]] attributed. Note that this last paragraph differs from the current standard, which would require ordering regardless. I would prefer if we could get rid off [[carries_dependency]] as well; currently, it's a hint whose effectiveness really depends on how the particular implementation handles this attribute. If we still need something like it in the future, it would be good if it had a clearer use and performance effects. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote: xagsmtp2.20140226130517.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: Hi, On Thu, 20 Feb 2014, Linus Torvalds wrote: But I'm pretty sure that any compiler guy must *hate* that current odd dependency-generation part, and if I was a gcc person, seeing that bugzilla entry Torvald pointed at, I would personally want to dismember somebody with a rusty spoon.. Yes. Defining dependency chains in the way the standard currently seems to do must come from people not writing compilers. There's simply no sensible way to implement it without being really conservative, because the depchains can contain arbitrary constructs including stores, loads and function calls but must still be observed. And with conservative I mean everything is a source of a dependency, and hence can't be removed, reordered or otherwise fiddled with, and that includes code sequences where no atomic objects are anywhere in sight [1]. In the light of that the only realistic way (meaning to not have to disable optimization everywhere) to implement consume as currently specified is to map it to acquire. At which point it becomes pointless. No, only memory_order_consume loads and [[carries_dependency]] function arguments are sources of dependency chains. However, that is, given how the standard specifies things, just one of the possible ways for how an implementation can handle this. Treating [[carries_dependency]] as a necessary annotation to make exploiting mo_consume work in practice is possible, but it's not required by the standard. Also, dependencies are specified to flow through loads and stores (restricted to scalar objects and bitfields), so any load that might load from a dependency-carrying store can also be a source (and that doesn't seem to be restricted by [[carries_dependency]]). OK, this last is clearly totally unacceptable. :-/ Leaving aside the option of dropping the whole thing for the moment, the only thing that suggests itself is having all dependencies die at a specific point in the code, corresponding to the rcu_read_unlock(). But as far as I can see, that absolutely requires necessary parameter and return marking in order to correctly handle nested RCU read-side critical sections in different functions. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 26 Feb 2014, Torvald Riegel wrote: On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: On Fri, 21 Feb 2014, Paul E. McKenney wrote: This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. C11 doesn't have attributes at all (and no specification regarding calls and dependencies that I can see). And the way I read the C++11 specification of carries_dependency is that specifying carries_dependency is purely about increasing optimization of the caller: that if it isn't specified, then the caller doesn't know what dependencies might be carried. Note: The carries_dependency attribute does not change the meaning of the program, but may result in generation of more efficient code. - end note. I think that this last sentence can be kind of misleading, especially when looking at it from an implementation point of view. How dependencies are handled (ie, preserving the syntactic dependencies vs. emitting barriers) must be part of the ABI, or things like [[carries_dependency]] won't work as expected (or lead to inefficient code). Thus, in practice, all compiler vendors on a platform would have to agree to a particular handling, which might end up in selecting the easy-but-conservative implementation option (ie, always emitting mo_acquire when the source uses mo_consume). Regardless of the ABI, my point is that if a program is valid, it is also valid when all uses of [[carries_dependency]] are removed. If a function doesn't use [[carries_dependency]], that means dependencies may or may not be carried through this function. If a function uses [[carries_dependency]], that means that certain dependencies *are* carried through the function (and the ABI should then specify what this means the caller can rely on, in terms of the architecture's memory model). (This may or may not be useful, but it's how I understand C++11.) -- Joseph S. Myers jos...@codesourcery.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-26 at 18:43 +, Joseph S. Myers wrote: On Wed, 26 Feb 2014, Torvald Riegel wrote: On Fri, 2014-02-21 at 22:10 +, Joseph S. Myers wrote: On Fri, 21 Feb 2014, Paul E. McKenney wrote: This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. C11 doesn't have attributes at all (and no specification regarding calls and dependencies that I can see). And the way I read the C++11 specification of carries_dependency is that specifying carries_dependency is purely about increasing optimization of the caller: that if it isn't specified, then the caller doesn't know what dependencies might be carried. Note: The carries_dependency attribute does not change the meaning of the program, but may result in generation of more efficient code. - end note. I think that this last sentence can be kind of misleading, especially when looking at it from an implementation point of view. How dependencies are handled (ie, preserving the syntactic dependencies vs. emitting barriers) must be part of the ABI, or things like [[carries_dependency]] won't work as expected (or lead to inefficient code). Thus, in practice, all compiler vendors on a platform would have to agree to a particular handling, which might end up in selecting the easy-but-conservative implementation option (ie, always emitting mo_acquire when the source uses mo_consume). Regardless of the ABI, my point is that if a program is valid, it is also valid when all uses of [[carries_dependency]] are removed. If a function doesn't use [[carries_dependency]], that means dependencies may or may not be carried through this function. If a function uses [[carries_dependency]], that means that certain dependencies *are* carried through the function (and the ABI should then specify what this means the caller can rely on, in terms of the architecture's memory model). (This may or may not be useful, but it's how I understand C++11.) I agree. What I tried to point out is that it's not the case that an *implementation* can just ignore [[carries_dependency]]. So from an implementation perspective, the attribute does have semantics. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote: > On 02/25/14 17:15, Paul E. McKenney wrote: > >>I have for the last several years been 100% convinced that the Intel > >>memory ordering is the right thing, and that people who like weak > >>memory ordering are wrong and should try to avoid reproducing if at > >>all possible. But given that we have memory orderings like power and > >>ARM, I don't actually see a sane way to get a good strong ordering. > >>You can teach compilers about cases like the above when they actually > >>see all the code and they could poison the value chain etc. But it > >>would be fairly painful, and once you cross object files (or even just > >>functions in the same compilation unit, for that matter), it goes from > >>painful to just "ridiculously not worth it". > > > >And I have indeed seen a post or two from you favoring stronger memory > >ordering over the past few years. ;-) > I couldn't agree more. > > > > >Are ARM and Power really the bad boys here? Or are they instead playing > >the role of the canary in the coal mine? > That's a question I've been struggling with recently as well. I > suspect they (arm, power) are going to be the outliers rather than > the canary. While the weaker model may give them some advantages WRT > scalability, I don't think it'll ultimately be enough to overcome > the difficulty in writing correct low level code for them. > > Regardless, they're here and we have to deal with them. Agreed... Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote: > wrote: > > wrote: > >> I have for the last several years been 100% convinced that the Intel > >> memory ordering is the right thing, and that people who like weak > >> memory ordering are wrong and should try to avoid reproducing if at > >> all possible. > > > > Are ARM and Power really the bad boys here? Or are they instead playing > > the role of the canary in the coal mine? > > To paraphrase some older threads, I think Linus's argument is that > weak memory ordering is like branch delay slots: a way to make a simple > implementation simpler, but ends up being no help to a more aggressive > implementation. > > Branch delay slots give a one-cycle bonus to in-order cores, but > once you go superscalar and add branch prediction, they stop helping, > and once you go full out of order, they're just an annoyance. > > Likewise, I can see the point that weak ordering can help make a simple > cache interface simpler, but once you start doing speculative loads, > you've already bought and paid for all the hardware you need to do > stronger coherency. > > Another thing that requires all the strong-coherency machinery is > a high-performance implementation of the various memory barrier and > synchronization operations. Yes, a low-performance (drain the pipeline) > implementation is tolerable if the instructions aren't used frequently, > but once you're really trying, it doesn't save complexity. > > Once you're there, strong coherency always doesn't actually cost you any > time outside of critical synchronization code, and it both simplifies > and speeds up the tricky synchronization software. > > > So PPC and ARM's weak ordering are not the direction the future is going. > Rather, weak ordering is something that's only useful in a limited > technology window, which is rapidly passing. That does indeed appear to be Intel's story. Might well be correct. Time will tell. > If you can find someone in IBM who's worked on the Z series cache > coherency (extremely strong ordering), they probably have some useful > insights. The big question is if strong ordering, once you've accepted > the implementation complexity and area, actually costs anything in > execution time. If there's an unavoidable cost which weak ordering saves, > that's significant. There has been a lot of ink spilled on this argument. ;-) PPC has much larger CPU counts than does the mainframe. On the other hand, there are large x86 systems. Some claim that there are differences in latency due to the different approaches, and there could be a long argument about whether all this in inherent in the memory ordering or whether it is due to implementation issues. I don't claim to know the answer. I do know that ARM and PPC are here now, and that I need to deal with them. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney > wrote: > > > > So let me see if I understand your reasoning. My best guess is that it > > goes something like this: > > > > 1. The Linux kernel contains code that passes pointers from > > rcu_dereference() through external functions. > > No, actually, it's not so much Linux-specific at all. > > I'm actually thinking about what I'd do as a compiler writer, and as a > defender the "C is a high-level assembler" concept. > > I love C. I'm a huge fan. I think it's a great language, and I think > it's a great language not because of some theoretical issues, but > because it is the only language around that actually maps fairly well > to what machines really do. > > And it's a *simple* language. Sure, it's not quite as simple as it > used to be, but look at how thin the "K book" is. Which pretty much > describes it - still. > > That's the real strength of C, and why it's the only language serious > people use for system programming. Ignore C++ for a while (Jesus > Xavier Christ, I've had to do C++ programming for subsurface), and > just think about what makes _C_ a good language. The last time I used C++ for a project was in 1990. It was a lot smaller then. > I can look at C code, and I can understand what the code generation > is, and what it will really *do*. And I think that's important. > Abstractions that hide what the compiler will actually generate are > bad abstractions. > > And ok, so this is obviously Linux-specific in that it's generally > only Linux where I really care about the code generation, but I do > think it's a bigger issue too. > > So I want C features to *map* to the hardware features they implement. > The abstractions should match each other, not fight each other. OK... > > Actually, the fact that there are more potential optimizations than I can > > think of is a big reason for my insistence on the carries-a-dependency > > crap. My lack of optimization omniscience makes me very nervous about > > relying on there never ever being a reasonable way of computing a given > > result without preserving the ordering. > > But if I can give two clear examples that are basically identical from > a syntactic standpoint, and one clearly can be trivially optimized to > the point where the ordering guarantee goes away, and the other > cannot, and you cannot describe the difference, then I think your > description is seriously lacking. In my defense, my plan was to constrain the compiler to retain the ordering guarantee in either case. Yes, I did notice that you find that unacceptable. > And I do *not* think the C language should be defined by how it can be > described. Leave that to things like Haskell or LISP, where the goal > is some kind of completeness of the language that is about the > language, not about the machines it will run on. I am with you up to the point that the fancy optimizers start kicking in. I don't know how to describe what the optimizers are and are not permitted to do strictly in terms of the underlying hardware. > >> So the code sequence I already mentioned is *not* ordered: > >> > >> Litmus test 1: > >> > >> p = atomic_read(pp, consume); > >> if (p == ) > >> return p->val; > >> > >>is *NOT* ordered, because the compiler can trivially turn this into > >> "return variable.val", and break the data dependency. > > > > Right, given your model, the compiler is free to produce code that > > doesn't order the load from pp against the load from p->val. > > Yes. Note also that that is what existing compilers would actually do. > > And they'd do it "by mistake": they'd load the address of the variable > into a register, and then compare the two registers, and then end up > using _one_ of the registers as the base pointer for the "p->val" > access, but I can almost *guarantee* that there are going to be > sequences where some compiler will choose one register over the other > based on some random detail. > > So my model isn't just a "model", it also happens to descibe reality. Sounds to me like your model -is- reality. I believe that it is useful to constrain reality from time to time, but understand that you vehemently disagree. > > Indeed, it won't work across different compilation units unless > > the compiler is told about it, which is of course the whole point of > > [[carries_dependency]]. Understood, though, the Linux kernel currently > > does not have anything that could reasonably automatically generate those > > [[carries_dependency]] attributes. (Or are there other reasons why you > > believe [[carries_dependency]] is problematic?) > > So I think carries_dependency is problematic because: > > - it's not actually in C11 afaik Indeed it is not, but I bet that gcc will implement it like it does the other attributes that are not part of C11. > - it requires the programmer to solve the problem of
Re: [RFC][PATCH 0/5] arch: atomic rework
On 02/25/14 17:15, Paul E. McKenney wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just "ridiculously not worth it". And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) I couldn't agree more. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? That's a question I've been struggling with recently as well. I suspect they (arm, power) are going to be the outliers rather than the canary. While the weaker model may give them some advantages WRT scalability, I don't think it'll ultimately be enough to overcome the difficulty in writing correct low level code for them. Regardless, they're here and we have to deal with them. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
wrote: > wrote: >> I have for the last several years been 100% convinced that the Intel >> memory ordering is the right thing, and that people who like weak >> memory ordering are wrong and should try to avoid reproducing if at >> all possible. > > Are ARM and Power really the bad boys here? Or are they instead playing > the role of the canary in the coal mine? To paraphrase some older threads, I think Linus's argument is that weak memory ordering is like branch delay slots: a way to make a simple implementation simpler, but ends up being no help to a more aggressive implementation. Branch delay slots give a one-cycle bonus to in-order cores, but once you go superscalar and add branch prediction, they stop helping, and once you go full out of order, they're just an annoyance. Likewise, I can see the point that weak ordering can help make a simple cache interface simpler, but once you start doing speculative loads, you've already bought and paid for all the hardware you need to do stronger coherency. Another thing that requires all the strong-coherency machinery is a high-performance implementation of the various memory barrier and synchronization operations. Yes, a low-performance (drain the pipeline) implementation is tolerable if the instructions aren't used frequently, but once you're really trying, it doesn't save complexity. Once you're there, strong coherency always doesn't actually cost you any time outside of critical synchronization code, and it both simplifies and speeds up the tricky synchronization software. So PPC and ARM's weak ordering are not the direction the future is going. Rather, weak ordering is something that's only useful in a limited technology window, which is rapidly passing. If you can find someone in IBM who's worked on the Z series cache coherency (extremely strong ordering), they probably have some useful insights. The big question is if strong ordering, once you've accepted the implementation complexity and area, actually costs anything in execution time. If there's an unavoidable cost which weak ordering saves, that's significant. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney wrote: > > So let me see if I understand your reasoning. My best guess is that it > goes something like this: > > 1. The Linux kernel contains code that passes pointers from > rcu_dereference() through external functions. No, actually, it's not so much Linux-specific at all. I'm actually thinking about what I'd do as a compiler writer, and as a defender the "C is a high-level assembler" concept. I love C. I'm a huge fan. I think it's a great language, and I think it's a great language not because of some theoretical issues, but because it is the only language around that actually maps fairly well to what machines really do. And it's a *simple* language. Sure, it's not quite as simple as it used to be, but look at how thin the "K book" is. Which pretty much describes it - still. That's the real strength of C, and why it's the only language serious people use for system programming. Ignore C++ for a while (Jesus Xavier Christ, I've had to do C++ programming for subsurface), and just think about what makes _C_ a good language. I can look at C code, and I can understand what the code generation is, and what it will really *do*. And I think that's important. Abstractions that hide what the compiler will actually generate are bad abstractions. And ok, so this is obviously Linux-specific in that it's generally only Linux where I really care about the code generation, but I do think it's a bigger issue too. So I want C features to *map* to the hardware features they implement. The abstractions should match each other, not fight each other. > Actually, the fact that there are more potential optimizations than I can > think of is a big reason for my insistence on the carries-a-dependency > crap. My lack of optimization omniscience makes me very nervous about > relying on there never ever being a reasonable way of computing a given > result without preserving the ordering. But if I can give two clear examples that are basically identical from a syntactic standpoint, and one clearly can be trivially optimized to the point where the ordering guarantee goes away, and the other cannot, and you cannot describe the difference, then I think your description is seriously lacking. And I do *not* think the C language should be defined by how it can be described. Leave that to things like Haskell or LISP, where the goal is some kind of completeness of the language that is about the language, not about the machines it will run on. >> So the code sequence I already mentioned is *not* ordered: >> >> Litmus test 1: >> >> p = atomic_read(pp, consume); >> if (p == ) >> return p->val; >> >>is *NOT* ordered, because the compiler can trivially turn this into >> "return variable.val", and break the data dependency. > > Right, given your model, the compiler is free to produce code that > doesn't order the load from pp against the load from p->val. Yes. Note also that that is what existing compilers would actually do. And they'd do it "by mistake": they'd load the address of the variable into a register, and then compare the two registers, and then end up using _one_ of the registers as the base pointer for the "p->val" access, but I can almost *guarantee* that there are going to be sequences where some compiler will choose one register over the other based on some random detail. So my model isn't just a "model", it also happens to descibe reality. > Indeed, it won't work across different compilation units unless > the compiler is told about it, which is of course the whole point of > [[carries_dependency]]. Understood, though, the Linux kernel currently > does not have anything that could reasonably automatically generate those > [[carries_dependency]] attributes. (Or are there other reasons why you > believe [[carries_dependency]] is problematic?) So I think carries_dependency is problematic because: - it's not actually in C11 afaik - it requires the programmer to solve the problem of the standard not matching the hardware. - I think it's just insanely ugly, *especially* if it's actually meant to work so that the current carries-a-dependency works even for insane expressions like "a-a". in practice, it's one of those things where I guess nobody actually would ever use it. > Of course, I cannot resist putting forward a third litmus test: > > static struct foo variable1; > static struct foo variable2; > static struct foo *pp = > > T1: initialize_foo(); > atomic_store_explicit(, , memory_order_release); > /* The above is the only store to pp in this translation unit, > * and the address of pp is not exported in any way. > */ > > T2: if (p == ) > return p->val1; /* Must be variable1.val1. */ > else > return p->val2; /* Must be variable2.val2. */ > > My guess is that your approach would not provide ordering in this > case,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds > wrote: > > > > Litmus test 1: > > > > p = atomic_read(pp, consume); > > if (p == ) > > return p->val; > > > >is *NOT* ordered > > Btw, don't get me wrong. I don't _like_ it not being ordered, and I > actually did spend some time thinking about my earlier proposal on > strengthening the 'consume' ordering. Understood. > I have for the last several years been 100% convinced that the Intel > memory ordering is the right thing, and that people who like weak > memory ordering are wrong and should try to avoid reproducing if at > all possible. But given that we have memory orderings like power and > ARM, I don't actually see a sane way to get a good strong ordering. > You can teach compilers about cases like the above when they actually > see all the code and they could poison the value chain etc. But it > would be fairly painful, and once you cross object files (or even just > functions in the same compilation unit, for that matter), it goes from > painful to just "ridiculously not worth it". And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) > So I think the C semantics should mirror what the hardware gives us - > and do so even in the face of reasonable optimizations - not try to do > something else that requires compilers to treat "consume" very > differently. I am sure that a great many people would jump for joy at the chance to drop any and all RCU-related verbiage from the C11 and C++11 standards. (I know, you aren't necessarily advocating this, but given what you say above, I cannot think what verbiage that would remain.) The thing that makes me very nervous is how much the definition of "reasonable optimization" has changed. For example, before the 2.6.10 Linux kernel, we didn't even apply volatile semantics to fetches of RCU-protected pointers -- and as far as I know, never needed to. But since then, there have been several cases where the compiler happily hoisted a normal load out of a surprisingly large loop. Hardware advances can come into play as well. For example, my very first RCU work back in the early 90s was on a parallel system whose CPUs had no branch-prediction hardware (80386 or 80486, I don't remember which). Now people talk about compilers using branch prediction hardware to implement value-speculation optimizations. Five or ten years from now, who knows what crazy optimizations might be considered to be completely reasonable? Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? > If people made me king of the world, I'd outlaw weak memory ordering. > You can re-order as much as you want in hardware with speculation etc, > but you should always *check* your speculation and make it *look* like > you did everything in order. Which is pretty much the intel memory > ordering (ignoring the write buffering). Speaking as someone who got whacked over the head with DEC Alpha when first presenting RCU to the Digital UNIX folks long ago, I do have some sympathy with this line of thought. But as you say, it is not the world we currently live in. Of course, in the final analysis, your kernel, your call. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds torva...@linux-foundation.org wrote: Litmus test 1: p = atomic_read(pp, consume); if (p == variable) return p-val; is *NOT* ordered Btw, don't get me wrong. I don't _like_ it not being ordered, and I actually did spend some time thinking about my earlier proposal on strengthening the 'consume' ordering. Understood. I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just ridiculously not worth it. And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) So I think the C semantics should mirror what the hardware gives us - and do so even in the face of reasonable optimizations - not try to do something else that requires compilers to treat consume very differently. I am sure that a great many people would jump for joy at the chance to drop any and all RCU-related verbiage from the C11 and C++11 standards. (I know, you aren't necessarily advocating this, but given what you say above, I cannot think what verbiage that would remain.) The thing that makes me very nervous is how much the definition of reasonable optimization has changed. For example, before the 2.6.10 Linux kernel, we didn't even apply volatile semantics to fetches of RCU-protected pointers -- and as far as I know, never needed to. But since then, there have been several cases where the compiler happily hoisted a normal load out of a surprisingly large loop. Hardware advances can come into play as well. For example, my very first RCU work back in the early 90s was on a parallel system whose CPUs had no branch-prediction hardware (80386 or 80486, I don't remember which). Now people talk about compilers using branch prediction hardware to implement value-speculation optimizations. Five or ten years from now, who knows what crazy optimizations might be considered to be completely reasonable? Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? If people made me king of the world, I'd outlaw weak memory ordering. You can re-order as much as you want in hardware with speculation etc, but you should always *check* your speculation and make it *look* like you did everything in order. Which is pretty much the intel memory ordering (ignoring the write buffering). Speaking as someone who got whacked over the head with DEC Alpha when first presenting RCU to the Digital UNIX folks long ago, I do have some sympathy with this line of thought. But as you say, it is not the world we currently live in. Of course, in the final analysis, your kernel, your call. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: So let me see if I understand your reasoning. My best guess is that it goes something like this: 1. The Linux kernel contains code that passes pointers from rcu_dereference() through external functions. No, actually, it's not so much Linux-specific at all. I'm actually thinking about what I'd do as a compiler writer, and as a defender the C is a high-level assembler concept. I love C. I'm a huge fan. I think it's a great language, and I think it's a great language not because of some theoretical issues, but because it is the only language around that actually maps fairly well to what machines really do. And it's a *simple* language. Sure, it's not quite as simple as it used to be, but look at how thin the KR book is. Which pretty much describes it - still. That's the real strength of C, and why it's the only language serious people use for system programming. Ignore C++ for a while (Jesus Xavier Christ, I've had to do C++ programming for subsurface), and just think about what makes _C_ a good language. I can look at C code, and I can understand what the code generation is, and what it will really *do*. And I think that's important. Abstractions that hide what the compiler will actually generate are bad abstractions. And ok, so this is obviously Linux-specific in that it's generally only Linux where I really care about the code generation, but I do think it's a bigger issue too. So I want C features to *map* to the hardware features they implement. The abstractions should match each other, not fight each other. Actually, the fact that there are more potential optimizations than I can think of is a big reason for my insistence on the carries-a-dependency crap. My lack of optimization omniscience makes me very nervous about relying on there never ever being a reasonable way of computing a given result without preserving the ordering. But if I can give two clear examples that are basically identical from a syntactic standpoint, and one clearly can be trivially optimized to the point where the ordering guarantee goes away, and the other cannot, and you cannot describe the difference, then I think your description is seriously lacking. And I do *not* think the C language should be defined by how it can be described. Leave that to things like Haskell or LISP, where the goal is some kind of completeness of the language that is about the language, not about the machines it will run on. So the code sequence I already mentioned is *not* ordered: Litmus test 1: p = atomic_read(pp, consume); if (p == variable) return p-val; is *NOT* ordered, because the compiler can trivially turn this into return variable.val, and break the data dependency. Right, given your model, the compiler is free to produce code that doesn't order the load from pp against the load from p-val. Yes. Note also that that is what existing compilers would actually do. And they'd do it by mistake: they'd load the address of the variable into a register, and then compare the two registers, and then end up using _one_ of the registers as the base pointer for the p-val access, but I can almost *guarantee* that there are going to be sequences where some compiler will choose one register over the other based on some random detail. So my model isn't just a model, it also happens to descibe reality. Indeed, it won't work across different compilation units unless the compiler is told about it, which is of course the whole point of [[carries_dependency]]. Understood, though, the Linux kernel currently does not have anything that could reasonably automatically generate those [[carries_dependency]] attributes. (Or are there other reasons why you believe [[carries_dependency]] is problematic?) So I think carries_dependency is problematic because: - it's not actually in C11 afaik - it requires the programmer to solve the problem of the standard not matching the hardware. - I think it's just insanely ugly, *especially* if it's actually meant to work so that the current carries-a-dependency works even for insane expressions like a-a. in practice, it's one of those things where I guess nobody actually would ever use it. Of course, I cannot resist putting forward a third litmus test: static struct foo variable1; static struct foo variable2; static struct foo *pp = variable1; T1: initialize_foo(variable2); atomic_store_explicit(pp, variable2, memory_order_release); /* The above is the only store to pp in this translation unit, * and the address of pp is not exported in any way. */ T2: if (p == variable1) return p-val1; /* Must be variable1.val1. */ else return p-val2; /* Must be variable2.val2. */ My guess is that your approach would not provide ordering in this case,
Re: [RFC][PATCH 0/5] arch: atomic rework
paul...@linux.vnet.ibm.com wrote: torva...@linux-foundation.org wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? To paraphrase some older threads, I think Linus's argument is that weak memory ordering is like branch delay slots: a way to make a simple implementation simpler, but ends up being no help to a more aggressive implementation. Branch delay slots give a one-cycle bonus to in-order cores, but once you go superscalar and add branch prediction, they stop helping, and once you go full out of order, they're just an annoyance. Likewise, I can see the point that weak ordering can help make a simple cache interface simpler, but once you start doing speculative loads, you've already bought and paid for all the hardware you need to do stronger coherency. Another thing that requires all the strong-coherency machinery is a high-performance implementation of the various memory barrier and synchronization operations. Yes, a low-performance (drain the pipeline) implementation is tolerable if the instructions aren't used frequently, but once you're really trying, it doesn't save complexity. Once you're there, strong coherency always doesn't actually cost you any time outside of critical synchronization code, and it both simplifies and speeds up the tricky synchronization software. So PPC and ARM's weak ordering are not the direction the future is going. Rather, weak ordering is something that's only useful in a limited technology window, which is rapidly passing. If you can find someone in IBM who's worked on the Z series cache coherency (extremely strong ordering), they probably have some useful insights. The big question is if strong ordering, once you've accepted the implementation complexity and area, actually costs anything in execution time. If there's an unavoidable cost which weak ordering saves, that's significant. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On 02/25/14 17:15, Paul E. McKenney wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just ridiculously not worth it. And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) I couldn't agree more. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? That's a question I've been struggling with recently as well. I suspect they (arm, power) are going to be the outliers rather than the canary. While the weaker model may give them some advantages WRT scalability, I don't think it'll ultimately be enough to overcome the difficulty in writing correct low level code for them. Regardless, they're here and we have to deal with them. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: So let me see if I understand your reasoning. My best guess is that it goes something like this: 1. The Linux kernel contains code that passes pointers from rcu_dereference() through external functions. No, actually, it's not so much Linux-specific at all. I'm actually thinking about what I'd do as a compiler writer, and as a defender the C is a high-level assembler concept. I love C. I'm a huge fan. I think it's a great language, and I think it's a great language not because of some theoretical issues, but because it is the only language around that actually maps fairly well to what machines really do. And it's a *simple* language. Sure, it's not quite as simple as it used to be, but look at how thin the KR book is. Which pretty much describes it - still. That's the real strength of C, and why it's the only language serious people use for system programming. Ignore C++ for a while (Jesus Xavier Christ, I've had to do C++ programming for subsurface), and just think about what makes _C_ a good language. The last time I used C++ for a project was in 1990. It was a lot smaller then. I can look at C code, and I can understand what the code generation is, and what it will really *do*. And I think that's important. Abstractions that hide what the compiler will actually generate are bad abstractions. And ok, so this is obviously Linux-specific in that it's generally only Linux where I really care about the code generation, but I do think it's a bigger issue too. So I want C features to *map* to the hardware features they implement. The abstractions should match each other, not fight each other. OK... Actually, the fact that there are more potential optimizations than I can think of is a big reason for my insistence on the carries-a-dependency crap. My lack of optimization omniscience makes me very nervous about relying on there never ever being a reasonable way of computing a given result without preserving the ordering. But if I can give two clear examples that are basically identical from a syntactic standpoint, and one clearly can be trivially optimized to the point where the ordering guarantee goes away, and the other cannot, and you cannot describe the difference, then I think your description is seriously lacking. In my defense, my plan was to constrain the compiler to retain the ordering guarantee in either case. Yes, I did notice that you find that unacceptable. And I do *not* think the C language should be defined by how it can be described. Leave that to things like Haskell or LISP, where the goal is some kind of completeness of the language that is about the language, not about the machines it will run on. I am with you up to the point that the fancy optimizers start kicking in. I don't know how to describe what the optimizers are and are not permitted to do strictly in terms of the underlying hardware. So the code sequence I already mentioned is *not* ordered: Litmus test 1: p = atomic_read(pp, consume); if (p == variable) return p-val; is *NOT* ordered, because the compiler can trivially turn this into return variable.val, and break the data dependency. Right, given your model, the compiler is free to produce code that doesn't order the load from pp against the load from p-val. Yes. Note also that that is what existing compilers would actually do. And they'd do it by mistake: they'd load the address of the variable into a register, and then compare the two registers, and then end up using _one_ of the registers as the base pointer for the p-val access, but I can almost *guarantee* that there are going to be sequences where some compiler will choose one register over the other based on some random detail. So my model isn't just a model, it also happens to descibe reality. Sounds to me like your model -is- reality. I believe that it is useful to constrain reality from time to time, but understand that you vehemently disagree. Indeed, it won't work across different compilation units unless the compiler is told about it, which is of course the whole point of [[carries_dependency]]. Understood, though, the Linux kernel currently does not have anything that could reasonably automatically generate those [[carries_dependency]] attributes. (Or are there other reasons why you believe [[carries_dependency]] is problematic?) So I think carries_dependency is problematic because: - it's not actually in C11 afaik Indeed it is not, but I bet that gcc will implement it like it does the other attributes that are not part of C11. - it requires the programmer to solve the problem of the standard not matching the hardware. The programmer in this instance being the compiler writer? - I
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote: paul...@linux.vnet.ibm.com wrote: torva...@linux-foundation.org wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? To paraphrase some older threads, I think Linus's argument is that weak memory ordering is like branch delay slots: a way to make a simple implementation simpler, but ends up being no help to a more aggressive implementation. Branch delay slots give a one-cycle bonus to in-order cores, but once you go superscalar and add branch prediction, they stop helping, and once you go full out of order, they're just an annoyance. Likewise, I can see the point that weak ordering can help make a simple cache interface simpler, but once you start doing speculative loads, you've already bought and paid for all the hardware you need to do stronger coherency. Another thing that requires all the strong-coherency machinery is a high-performance implementation of the various memory barrier and synchronization operations. Yes, a low-performance (drain the pipeline) implementation is tolerable if the instructions aren't used frequently, but once you're really trying, it doesn't save complexity. Once you're there, strong coherency always doesn't actually cost you any time outside of critical synchronization code, and it both simplifies and speeds up the tricky synchronization software. So PPC and ARM's weak ordering are not the direction the future is going. Rather, weak ordering is something that's only useful in a limited technology window, which is rapidly passing. That does indeed appear to be Intel's story. Might well be correct. Time will tell. If you can find someone in IBM who's worked on the Z series cache coherency (extremely strong ordering), they probably have some useful insights. The big question is if strong ordering, once you've accepted the implementation complexity and area, actually costs anything in execution time. If there's an unavoidable cost which weak ordering saves, that's significant. There has been a lot of ink spilled on this argument. ;-) PPC has much larger CPU counts than does the mainframe. On the other hand, there are large x86 systems. Some claim that there are differences in latency due to the different approaches, and there could be a long argument about whether all this in inherent in the memory ordering or whether it is due to implementation issues. I don't claim to know the answer. I do know that ARM and PPC are here now, and that I need to deal with them. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote: On 02/25/14 17:15, Paul E. McKenney wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just ridiculously not worth it. And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) I couldn't agree more. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? That's a question I've been struggling with recently as well. I suspect they (arm, power) are going to be the outliers rather than the canary. While the weaker model may give them some advantages WRT scalability, I don't think it'll ultimately be enough to overcome the difficulty in writing correct low level code for them. Regardless, they're here and we have to deal with them. Agreed... Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds wrote: > > Litmus test 1: > > p = atomic_read(pp, consume); > if (p == ) > return p->val; > >is *NOT* ordered Btw, don't get me wrong. I don't _like_ it not being ordered, and I actually did spend some time thinking about my earlier proposal on strengthening the 'consume' ordering. I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just "ridiculously not worth it". So I think the C semantics should mirror what the hardware gives us - and do so even in the face of reasonable optimizations - not try to do something else that requires compilers to treat "consume" very differently. If people made me king of the world, I'd outlaw weak memory ordering. You can re-order as much as you want in hardware with speculation etc, but you should always *check* your speculation and make it *look* like you did everything in order. Which is pretty much the intel memory ordering (ignoring the write buffering). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 03:35:04PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney > wrote: > >> > >> What if the "nothing modifies 'p'" part looks like this: > >> > >> if (p != ) > >> return; > >> > >> and now any sane compiler will happily optimize "q = *p" into "q = > >> myvariable", and we're all done - nothing invalid was ever > > > > Yes, the compiler could do that. But it would still be required to > > carry a dependency from the memory_order_consume read to the "*p", > > But that's *BS*. You didn't actually listen to the main issue. > > Paul, why do you insist on this carries-a-dependency crap? Sigh. Read on... > It's broken. If you don't believe me, then believe the compiler person > who already piped up and told you so. > > The "carries a dependency" model is broken. Get over it. > > No sane compiler will ever distinguish two different registers that > have the same value from each other. No sane compiler will ever say > "ok, register r1 has the exact same value as register r2, but r2 > carries the dependency, so I need to make sure to pass r2 to that > function or use it as a base pointer". > > And nobody sane should *expect* a compiler to distinguish two > registers with the same value that way. > > So the whole model is broken. > > I gave an alternate model (the "restrict"), and you didn't seem to > understand the really fundamental difference. It's not a language > difference, it's a conceptual difference. > > In the broken "carries a dependency" model, you have fight all those > aliases that can have the same value, and it is not a fight you can > win. We've had the "p-p" examples, we've had the "p&0" examples, but > the fact is, that "p==" example IS EXACTLY THE SAME THING. > > All three of those things: "p-p", "p&0", and "p==" mean > that any compiler worth its salt now know that "p" carries no > information, and will optimize it away. > > So please stop arguing against that. Whenever you argue against that > simple fact, you are arguing against sane compilers. So let me see if I understand your reasoning. My best guess is that it goes something like this: 1. The Linux kernel contains code that passes pointers from rcu_dereference() through external functions. 2. Code in the Linux kernel expects the normal RCU ordering guarantees to be in effect even when external functions are involved. 3. When compiling one of these external functions, the C compiler has no way of knowing about these RCU ordering guarantees. 4. The C compiler might therefore apply any and all optimizations to these external functions. 5. This in turn implies that we the only way to prohibit any given optimization from being applied to the results obtained from rcu_dereference() is to prohibit that optimization globally. 6. We have to be very careful what optimizations are globally prohibited, because a poor choice could result in unacceptable performance degradation. 7. Therefore, the only operations that can be counted on to maintain the needed RCU orderings are those where the compiler really doesn't have any choice, in other words, where any reasonable way of computing the result will necessarily maintain the needed ordering. Did I get this right, or am I confused? > So *accept* the fact that some operations (and I guarantee that there > are more of those than you can think of, and you can create them with > various tricks using pretty much *any* feature in the C language) > essentially take the data information away. And just accept the fact > that then the ordering goes away too. Actually, the fact that there are more potential optimizations than I can think of is a big reason for my insistence on the carries-a-dependency crap. My lack of optimization omniscience makes me very nervous about relying on there never ever being a reasonable way of computing a given result without preserving the ordering. > So give up on "carries a dependency". Because there will be cases > where that dependency *isn't* carried. > > The language of the standard needs to get *away* from the broken > model, because otherwise the standard is broken. > > I suggest we instead talk about "litmus tests" and why certain code > sequences are ordered, and others are not. OK... > So the code sequence I already mentioned is *not* ordered: > > Litmus test 1: > > p = atomic_read(pp, consume); > if (p == ) > return p->val; > >is *NOT* ordered, because the compiler can trivially turn this into > "return variable.val", and break the data dependency. Right, given your model, the compiler is free to produce code that doesn't order the load from pp against the load from p->val. >This is true *regardless* of any "carries a dependency" language, > because that language is insane, and doesn't work when the