Re: Linux-kernel examples for LKMM recipes
On Wed, Oct 18, 2017 at 05:18:37PM -0400, Alan Stern wrote: > On Wed, 18 Oct 2017, Paul E. McKenney wrote: > > > > Well, you could explicitly mention that in the multi-thread case, this > > > means all accesses to the shared variable had better use READ_ONCE() or > > > WRITE_ONCE(). > > > > Like this? > > > > Thanx, Paul > > > > > > > > d. If there are multiple CPUs, accesses to shared variables > > should use READ_ONCE() and WRITE_ONCE() or stronger > > to prevent load/store tearing, load/store fusing, and > > invented loads and stores. There are exceptions to > > this rule, for example: > > > > i. When there is no possibility of a given > > shared variable being updated, for example, > > while holding the update-side lock, reads > > from that variable need not use READ_ONCE(). > > > > ii. When there is no possibility of a given shared > > variable being either read or updated, for > > example, when running during early boot, reads > > from that variable need not use READ_ONCE() and > > writes to that variable need not use WRITE_ONCE(). > > Yeah, except that you mean being read or updated by another thread. Good point, added that qualifier. Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Wed, Oct 18, 2017 at 05:18:37PM -0400, Alan Stern wrote: > On Wed, 18 Oct 2017, Paul E. McKenney wrote: > > > > Well, you could explicitly mention that in the multi-thread case, this > > > means all accesses to the shared variable had better use READ_ONCE() or > > > WRITE_ONCE(). > > > > Like this? > > > > Thanx, Paul > > > > > > > > d. If there are multiple CPUs, accesses to shared variables > > should use READ_ONCE() and WRITE_ONCE() or stronger > > to prevent load/store tearing, load/store fusing, and > > invented loads and stores. There are exceptions to > > this rule, for example: > > > > i. When there is no possibility of a given > > shared variable being updated, for example, > > while holding the update-side lock, reads > > from that variable need not use READ_ONCE(). > > > > ii. When there is no possibility of a given shared > > variable being either read or updated, for > > example, when running during early boot, reads > > from that variable need not use READ_ONCE() and > > writes to that variable need not use WRITE_ONCE(). > > Yeah, except that you mean being read or updated by another thread. Good point, added that qualifier. Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Wed, 18 Oct 2017, Paul E. McKenney wrote: > > Well, you could explicitly mention that in the multi-thread case, this > > means all accesses to the shared variable had better use READ_ONCE() or > > WRITE_ONCE(). > > Like this? > > Thanx, Paul > > > > d. If there are multiple CPUs, accesses to shared variables > should use READ_ONCE() and WRITE_ONCE() or stronger > to prevent load/store tearing, load/store fusing, and > invented loads and stores. There are exceptions to > this rule, for example: > > i. When there is no possibility of a given > shared variable being updated, for example, > while holding the update-side lock, reads > from that variable need not use READ_ONCE(). > > ii. When there is no possibility of a given shared > variable being either read or updated, for > example, when running during early boot, reads > from that variable need not use READ_ONCE() and > writes to that variable need not use WRITE_ONCE(). Yeah, except that you mean being read or updated by another thread. Alan
Re: Linux-kernel examples for LKMM recipes
On Wed, 18 Oct 2017, Paul E. McKenney wrote: > > Well, you could explicitly mention that in the multi-thread case, this > > means all accesses to the shared variable had better use READ_ONCE() or > > WRITE_ONCE(). > > Like this? > > Thanx, Paul > > > > d. If there are multiple CPUs, accesses to shared variables > should use READ_ONCE() and WRITE_ONCE() or stronger > to prevent load/store tearing, load/store fusing, and > invented loads and stores. There are exceptions to > this rule, for example: > > i. When there is no possibility of a given > shared variable being updated, for example, > while holding the update-side lock, reads > from that variable need not use READ_ONCE(). > > ii. When there is no possibility of a given shared > variable being either read or updated, for > example, when running during early boot, reads > from that variable need not use READ_ONCE() and > writes to that variable need not use WRITE_ONCE(). Yeah, except that you mean being read or updated by another thread. Alan
Re: Linux-kernel examples for LKMM recipes
On Wed, Oct 18, 2017 at 10:43:42AM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > > > That is, a compiler can emit whatever code it likes, > > > > > > as long as the results appear just as if the compiler > > > > > > had followed all the relevant rules. To see this, > > > > > > compiler with a high level of optimization and run > > > > > > the debugger on the resulting binary. > > > > > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result > > > > > as > > > > > if it had executed in order (even for a single variable!), and if you > > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > > > it likes. > > > > > > > > Ah, I omitted an important qualifier: > > > > > > > > b. Compilers are permitted to use the "as-if" rule. That > > > > is, > > > > a compiler can emit whatever code it likes, as long as > > > > the results of a single-threaded execution appear just > > > > as if the compiler had followed all the relevant rules. > > > > To see this, compile with a high level of optimization > > > > and run the debugger on the resulting binary. > > > > > > That's okay for the single-CPU case. I don't think it covers the > > > multiple-CPU single-variable case correctly, though. If you don't use > > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > > > and stores? And won't that potentially cause the end result to be > > > different from what you would get if the code had appeared to execute > > > in order? > > > > Ah, good point, I need yet another qualifier. How about the following? > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > a compiler can emit whatever code it likes for normal > > accesses, as long as the results of a single-threaded > > execution appear just as if the compiler had followed > > all the relevant rules. To see this, compile with a > > high level of optimization and run the debugger on the > > resulting binary. > > > > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), > > and atomics. This, in conjunction with the previously added > > "single-threaded execution" means that yes, the compiler is permitted > > to tear normal loads and stores. The reason is that a single-threaded > > run could not tell the difference. Interrupt handlers or multiple > > threads are required to detect load/store tearing. > > > > So, what am I still missing? ;-) > > Well, you could explicitly mention that in the multi-thread case, this > means all accesses to the shared variable had better use READ_ONCE() or > WRITE_ONCE(). Like this? Thanx, Paul d. If there are multiple CPUs, accesses to shared variables should use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store tearing, load/store fusing, and invented loads and stores. There are exceptions to this rule, for example: i. When there is no possibility of a given shared variable being updated, for example, while holding the update-side lock, reads from that variable need not use READ_ONCE(). ii. When there is no possibility of a given shared variable being either read or updated, for example, when running during early boot, reads from that variable need not use READ_ONCE() and writes to that variable need not use WRITE_ONCE().
Re: Linux-kernel examples for LKMM recipes
On Wed, Oct 18, 2017 at 10:43:42AM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > > > That is, a compiler can emit whatever code it likes, > > > > > > as long as the results appear just as if the compiler > > > > > > had followed all the relevant rules. To see this, > > > > > > compiler with a high level of optimization and run > > > > > > the debugger on the resulting binary. > > > > > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result > > > > > as > > > > > if it had executed in order (even for a single variable!), and if you > > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > > > it likes. > > > > > > > > Ah, I omitted an important qualifier: > > > > > > > > b. Compilers are permitted to use the "as-if" rule. That > > > > is, > > > > a compiler can emit whatever code it likes, as long as > > > > the results of a single-threaded execution appear just > > > > as if the compiler had followed all the relevant rules. > > > > To see this, compile with a high level of optimization > > > > and run the debugger on the resulting binary. > > > > > > That's okay for the single-CPU case. I don't think it covers the > > > multiple-CPU single-variable case correctly, though. If you don't use > > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > > > and stores? And won't that potentially cause the end result to be > > > different from what you would get if the code had appeared to execute > > > in order? > > > > Ah, good point, I need yet another qualifier. How about the following? > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > a compiler can emit whatever code it likes for normal > > accesses, as long as the results of a single-threaded > > execution appear just as if the compiler had followed > > all the relevant rules. To see this, compile with a > > high level of optimization and run the debugger on the > > resulting binary. > > > > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), > > and atomics. This, in conjunction with the previously added > > "single-threaded execution" means that yes, the compiler is permitted > > to tear normal loads and stores. The reason is that a single-threaded > > run could not tell the difference. Interrupt handlers or multiple > > threads are required to detect load/store tearing. > > > > So, what am I still missing? ;-) > > Well, you could explicitly mention that in the multi-thread case, this > means all accesses to the shared variable had better use READ_ONCE() or > WRITE_ONCE(). Like this? Thanx, Paul d. If there are multiple CPUs, accesses to shared variables should use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store tearing, load/store fusing, and invented loads and stores. There are exceptions to this rule, for example: i. When there is no possibility of a given shared variable being updated, for example, while holding the update-side lock, reads from that variable need not use READ_ONCE(). ii. When there is no possibility of a given shared variable being either read or updated, for example, when running during early boot, reads from that variable need not use READ_ONCE() and writes to that variable need not use WRITE_ONCE().
Re: Linux-kernel examples for LKMM recipes
On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > > That is, a compiler can emit whatever code it likes, > > > > > as long as the results appear just as if the compiler > > > > > had followed all the relevant rules. To see this, > > > > > compiler with a high level of optimization and run > > > > > the debugger on the resulting binary. > > > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > > if it had executed in order (even for a single variable!), and if you > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > > it likes. > > > > > > Ah, I omitted an important qualifier: > > > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > > a compiler can emit whatever code it likes, as long as > > > the results of a single-threaded execution appear just > > > as if the compiler had followed all the relevant rules. > > > To see this, compile with a high level of optimization > > > and run the debugger on the resulting binary. > > > > That's okay for the single-CPU case. I don't think it covers the > > multiple-CPU single-variable case correctly, though. If you don't use > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > > and stores? And won't that potentially cause the end result to be > > different from what you would get if the code had appeared to execute > > in order? > > Ah, good point, I need yet another qualifier. How about the following? > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes for normal > accesses, as long as the results of a single-threaded > execution appear just as if the compiler had followed > all the relevant rules. To see this, compile with a > high level of optimization and run the debugger on the > resulting binary. > > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), > and atomics. This, in conjunction with the previously added > "single-threaded execution" means that yes, the compiler is permitted > to tear normal loads and stores. The reason is that a single-threaded > run could not tell the difference. Interrupt handlers or multiple > threads are required to detect load/store tearing. > > So, what am I still missing? ;-) Well, you could explicitly mention that in the multi-thread case, this means all accesses to the shared variable had better use READ_ONCE() or WRITE_ONCE(). Alan
Re: Linux-kernel examples for LKMM recipes
On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > > That is, a compiler can emit whatever code it likes, > > > > > as long as the results appear just as if the compiler > > > > > had followed all the relevant rules. To see this, > > > > > compiler with a high level of optimization and run > > > > > the debugger on the resulting binary. > > > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > > if it had executed in order (even for a single variable!), and if you > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > > it likes. > > > > > > Ah, I omitted an important qualifier: > > > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > > a compiler can emit whatever code it likes, as long as > > > the results of a single-threaded execution appear just > > > as if the compiler had followed all the relevant rules. > > > To see this, compile with a high level of optimization > > > and run the debugger on the resulting binary. > > > > That's okay for the single-CPU case. I don't think it covers the > > multiple-CPU single-variable case correctly, though. If you don't use > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > > and stores? And won't that potentially cause the end result to be > > different from what you would get if the code had appeared to execute > > in order? > > Ah, good point, I need yet another qualifier. How about the following? > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes for normal > accesses, as long as the results of a single-threaded > execution appear just as if the compiler had followed > all the relevant rules. To see this, compile with a > high level of optimization and run the debugger on the > resulting binary. > > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), > and atomics. This, in conjunction with the previously added > "single-threaded execution" means that yes, the compiler is permitted > to tear normal loads and stores. The reason is that a single-threaded > run could not tell the difference. Interrupt handlers or multiple > threads are required to detect load/store tearing. > > So, what am I still missing? ;-) Well, you could explicitly mention that in the multi-thread case, this means all accesses to the shared variable had better use READ_ONCE() or WRITE_ONCE(). Alan
Re: Linux-kernel examples for LKMM recipes
On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > > > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > > > How about this? > > > > > > > > 0. Simple special cases > > > > > > > > If there is only one CPU on the one hand or only one variable > > > > on the other, the code will execute in order. There are (as > > > > usual) some things to be careful of: > > > > > > > > a. There are some aspects of the C language that are > > > > unordered. For example, the compiler can output code > > > > computing arguments of a multi-parameter function in > > > > any order it likes, or even interleaved if it so > > > > chooses. > > > > > > That parses a little oddly. I wouldn't agree that the compiler outputs > > > the code in any order it likes! > > > > When was the last time you talked to a compiler writer? ;-) > > > > > In fact, I wouldn't even mention the compiler at all. Just say that > > > (with a few exceptions) the language doesn't specify the order in which > > > the arguments of a function or operation should be evaluated. For > > > example, in the expression "f(x) + g(y)", the order in which f and g > > > are called is not defined; the object code is allowed to use either > > > order or even to interleave the computations. > > > > Nevertheless, I took your suggestion: > > > > a. There are some aspects of the C language that are > > unordered. For example, in the expression "f(x) + g(y)", > > the order in which f and g are called is not defined; > > the object code is allowed to use either order or even > > to interleave the computations. > > Good. > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > That is, a compiler can emit whatever code it likes, > > > > as long as the results appear just as if the compiler > > > > had followed all the relevant rules. To see this, > > > > compiler with a high level of optimization and run > > > > the debugger on the resulting binary. > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > if it had executed in order (even for a single variable!), and if you > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > it likes. > > > > Ah, I omitted an important qualifier: > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > a compiler can emit whatever code it likes, as long as > > the results of a single-threaded execution appear just > > as if the compiler had followed all the relevant rules. > > To see this, compile with a high level of optimization > > and run the debugger on the resulting binary. > > That's okay for the single-CPU case. I don't think it covers the > multiple-CPU single-variable case correctly, though. If you don't use > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > and stores? And won't that potentially cause the end result to be > different from what you would get if the code had appeared to execute > in order? Ah, good point, I need yet another qualifier. How about the following? b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes for normal accesses, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), and atomics. This, in conjunction with the previously added "single-threaded execution" means that yes, the compiler is permitted to tear normal loads and stores. The reason is that a single-threaded run could not tell the difference. Interrupt handlers or multiple threads are required to detect load/store tearing. So, what am I still missing? ;-) > > I have seen people (including kernel hackers) surprised by what optimizers > > do, so I would prefer that the last sentence remain. > > > > > > c. If there is only one variable but multiple CPUs, all > > > > accesses to that variable must be aligned and full > > > > sized. > > > > > > I would say that the variable is what needs to be aligned, not the > > > accesses. (Although, if the variable is aligned and all the accesses > > > are full sized, then they must necessarily be aligned as well.) > > > > I was
Re: Linux-kernel examples for LKMM recipes
On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > > > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > > > How about this? > > > > > > > > 0. Simple special cases > > > > > > > > If there is only one CPU on the one hand or only one variable > > > > on the other, the code will execute in order. There are (as > > > > usual) some things to be careful of: > > > > > > > > a. There are some aspects of the C language that are > > > > unordered. For example, the compiler can output code > > > > computing arguments of a multi-parameter function in > > > > any order it likes, or even interleaved if it so > > > > chooses. > > > > > > That parses a little oddly. I wouldn't agree that the compiler outputs > > > the code in any order it likes! > > > > When was the last time you talked to a compiler writer? ;-) > > > > > In fact, I wouldn't even mention the compiler at all. Just say that > > > (with a few exceptions) the language doesn't specify the order in which > > > the arguments of a function or operation should be evaluated. For > > > example, in the expression "f(x) + g(y)", the order in which f and g > > > are called is not defined; the object code is allowed to use either > > > order or even to interleave the computations. > > > > Nevertheless, I took your suggestion: > > > > a. There are some aspects of the C language that are > > unordered. For example, in the expression "f(x) + g(y)", > > the order in which f and g are called is not defined; > > the object code is allowed to use either order or even > > to interleave the computations. > > Good. > > > > > b. Compilers are permitted to use the "as-if" rule. > > > > That is, a compiler can emit whatever code it likes, > > > > as long as the results appear just as if the compiler > > > > had followed all the relevant rules. To see this, > > > > compiler with a high level of optimization and run > > > > the debugger on the resulting binary. > > > > > > You might omit the last sentence. Furthermore, if the accesses don't > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > > if it had executed in order (even for a single variable!), and if you > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > > it likes. > > > > Ah, I omitted an important qualifier: > > > > b. Compilers are permitted to use the "as-if" rule. That is, > > a compiler can emit whatever code it likes, as long as > > the results of a single-threaded execution appear just > > as if the compiler had followed all the relevant rules. > > To see this, compile with a high level of optimization > > and run the debugger on the resulting binary. > > That's okay for the single-CPU case. I don't think it covers the > multiple-CPU single-variable case correctly, though. If you don't use > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads > and stores? And won't that potentially cause the end result to be > different from what you would get if the code had appeared to execute > in order? Ah, good point, I need yet another qualifier. How about the following? b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes for normal accesses, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(), and atomics. This, in conjunction with the previously added "single-threaded execution" means that yes, the compiler is permitted to tear normal loads and stores. The reason is that a single-threaded run could not tell the difference. Interrupt handlers or multiple threads are required to detect load/store tearing. So, what am I still missing? ;-) > > I have seen people (including kernel hackers) surprised by what optimizers > > do, so I would prefer that the last sentence remain. > > > > > > c. If there is only one variable but multiple CPUs, all > > > > accesses to that variable must be aligned and full > > > > sized. > > > > > > I would say that the variable is what needs to be aligned, not the > > > accesses. (Although, if the variable is aligned and all the accesses > > > are full sized, then they must necessarily be aligned as well.) > > > > I was
Re: Linux-kernel examples for LKMM recipes
On Tue, 17 Oct 2017, Paul E. McKenney wrote: > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > How about this? > > > > > > 0.Simple special cases > > > > > > If there is only one CPU on the one hand or only one variable > > > on the other, the code will execute in order. There are (as > > > usual) some things to be careful of: > > > > > > a. There are some aspects of the C language that are > > > unordered. For example, the compiler can output code > > > computing arguments of a multi-parameter function in > > > any order it likes, or even interleaved if it so chooses. > > > > That parses a little oddly. I wouldn't agree that the compiler outputs > > the code in any order it likes! > > When was the last time you talked to a compiler writer? ;-) > > > In fact, I wouldn't even mention the compiler at all. Just say that > > (with a few exceptions) the language doesn't specify the order in which > > the arguments of a function or operation should be evaluated. For > > example, in the expression "f(x) + g(y)", the order in which f and g > > are called is not defined; the object code is allowed to use either > > order or even to interleave the computations. > > Nevertheless, I took your suggestion: > > a. There are some aspects of the C language that are > unordered. For example, in the expression "f(x) + g(y)", > the order in which f and g are called is not defined; > the object code is allowed to use either order or even > to interleave the computations. Good. > > > b. Compilers are permitted to use the "as-if" rule. > > > That is, a compiler can emit whatever code it likes, > > > as long as the results appear just as if the compiler > > > had followed all the relevant rules. To see this, > > > compiler with a high level of optimization and run > > > the debugger on the resulting binary. > > > > You might omit the last sentence. Furthermore, if the accesses don't > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > if it had executed in order (even for a single variable!), and if you > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > it likes. > > Ah, I omitted an important qualifier: > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes, as long as > the results of a single-threaded execution appear just > as if the compiler had followed all the relevant rules. > To see this, compile with a high level of optimization > and run the debugger on the resulting binary. That's okay for the single-CPU case. I don't think it covers the multiple-CPU single-variable case correctly, though. If you don't use READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads and stores? And won't that potentially cause the end result to be different from what you would get if the code had appeared to execute in order? > I have seen people (including kernel hackers) surprised by what optimizers > do, so I would prefer that the last sentence remain. > > > > c. If there is only one variable but multiple CPUs, all > > > accesses to that variable must be aligned and full sized. > > > > I would say that the variable is what needs to be aligned, not the > > accesses. (Although, if the variable is aligned and all the accesses > > are full sized, then they must necessarily be aligned as well.) > > I was thinking in terms of an unaligned 16-bit access to a 32-bit > variable. That wouldn't be full sized. > But how about this? > > c. If there is only one variable but multiple CPUs, all Extra "all". Otherwise okay. > that variable must be properly aligned and all accesses > to that variable must be full sized. > > > > Variables that straddle cachelines or pages void your > > > full-ordering warranty, as do undersized accesses that > > > load from or store to only part of the variable. > > > > How can a variable straddle pages without also straddling cache lines? > > Well, a variable -can- straddle cachelines without straddling pages, > which justifies the "or". Furthermore, given that cacheline sizes have > been growing, but pages are still 4KB, it is probably only a matter > of time. ;-) By that time, we'll probably be using 64-KB pages. Or even bigger! Alan
Re: Linux-kernel examples for LKMM recipes
On Tue, 17 Oct 2017, Paul E. McKenney wrote: > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > > > How about this? > > > > > > 0.Simple special cases > > > > > > If there is only one CPU on the one hand or only one variable > > > on the other, the code will execute in order. There are (as > > > usual) some things to be careful of: > > > > > > a. There are some aspects of the C language that are > > > unordered. For example, the compiler can output code > > > computing arguments of a multi-parameter function in > > > any order it likes, or even interleaved if it so chooses. > > > > That parses a little oddly. I wouldn't agree that the compiler outputs > > the code in any order it likes! > > When was the last time you talked to a compiler writer? ;-) > > > In fact, I wouldn't even mention the compiler at all. Just say that > > (with a few exceptions) the language doesn't specify the order in which > > the arguments of a function or operation should be evaluated. For > > example, in the expression "f(x) + g(y)", the order in which f and g > > are called is not defined; the object code is allowed to use either > > order or even to interleave the computations. > > Nevertheless, I took your suggestion: > > a. There are some aspects of the C language that are > unordered. For example, in the expression "f(x) + g(y)", > the order in which f and g are called is not defined; > the object code is allowed to use either order or even > to interleave the computations. Good. > > > b. Compilers are permitted to use the "as-if" rule. > > > That is, a compiler can emit whatever code it likes, > > > as long as the results appear just as if the compiler > > > had followed all the relevant rules. To see this, > > > compiler with a high level of optimization and run > > > the debugger on the resulting binary. > > > > You might omit the last sentence. Furthermore, if the accesses don't > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > > if it had executed in order (even for a single variable!), and if you > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > > it likes. > > Ah, I omitted an important qualifier: > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes, as long as > the results of a single-threaded execution appear just > as if the compiler had followed all the relevant rules. > To see this, compile with a high level of optimization > and run the debugger on the resulting binary. That's okay for the single-CPU case. I don't think it covers the multiple-CPU single-variable case correctly, though. If you don't use READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads and stores? And won't that potentially cause the end result to be different from what you would get if the code had appeared to execute in order? > I have seen people (including kernel hackers) surprised by what optimizers > do, so I would prefer that the last sentence remain. > > > > c. If there is only one variable but multiple CPUs, all > > > accesses to that variable must be aligned and full sized. > > > > I would say that the variable is what needs to be aligned, not the > > accesses. (Although, if the variable is aligned and all the accesses > > are full sized, then they must necessarily be aligned as well.) > > I was thinking in terms of an unaligned 16-bit access to a 32-bit > variable. That wouldn't be full sized. > But how about this? > > c. If there is only one variable but multiple CPUs, all Extra "all". Otherwise okay. > that variable must be properly aligned and all accesses > to that variable must be full sized. > > > > Variables that straddle cachelines or pages void your > > > full-ordering warranty, as do undersized accesses that > > > load from or store to only part of the variable. > > > > How can a variable straddle pages without also straddling cache lines? > > Well, a variable -can- straddle cachelines without straddling pages, > which justifies the "or". Furthermore, given that cacheline sizes have > been growing, but pages are still 4KB, it is probably only a matter > of time. ;-) By that time, we'll probably be using 64-KB pages. Or even bigger! Alan
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > > Hello! > > > > At Linux Plumbers Conference, we got requests for a recipes document, > > and a further request to point to actual code in the Linux kernel. > > I have pulled together some examples for various litmus-test families, > > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > > MP, ...) is here: > > > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > > > This document is also checked into the memory-models git archive: > > > > https://github.com/aparri/memory-model.git > > > > I would be especially interested in simpler examples in general, and > > of course any example at all for the cases where I was unable to find > > any. Thoughts? > > > > Thanx, Paul > > > > > > > > This document lists the litmus-test patterns that we have been discussing, > > along with examples from the Linux kernel. This is intended to feed into > > the recipes document. All examples are from v4.13. > > > > 0. Single-variable SC. > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > (see kernel/rcu/tree.c). The counter is accessed by > > interrupts and NMIs as well as by process-level code. > > This counter can be accessed by other CPUs, but only > > for debug output. > > > > b. Between CPUs, I would put forward the ->dflags > > updates, but this is anything but simple. But maybe > > OK for an illustration? > > > > 1. MP (see test6.pdf for nickname translation) > > > > a. smp_store_release() / smp_load_acquire() > > > > init_stack_slab() in lib/stackdepot.c uses release-acquire > > to handle initialization of a slab of the stack. Working > > out the mutual-exclusion design is left as an exercise for > > the reader. > > > > b. rcu_assign_pointer() / rcu_dereference() > > > > expand_to_next_prime() does the rcu_assign_pointer(), > > and next_prime_number() does the rcu_dereference(). > > This mediates access to a bit vector that is expanded > > as additional primes are needed. These two functions > > are in lib/prime_numbers.c. > > > > c. smp_wmb() / smp_rmb() > > > > xlog_state_switch_iclogs() contains the following: > > > > log->l_curr_block -= log->l_logBBsize; > > ASSERT(log->l_curr_block >= 0); > > smp_wmb(); > > log->l_curr_cycle++; > > > > And xlog_valid_lsn() contains the following: > > > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > > smp_rmb(); > > cur_block = ACCESS_ONCE(log->l_curr_block); > > > > d. Replacing either of the above with smp_mb() > > > > Holding off on this one for the moment... > > > > 2. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > > > Lots of variety here, can in some cases substitute: > > > > a. READ_ONCE() for smp_load_acquire() > > b. WRITE_ONCE() for smp_store_release() > > c. Dependencies for both smp_load_acquire() and > > smp_store_release(). > > d. smp_wmb() for smp_store_release() in first thread > > of ISA2 and Z6.2. > > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > > > The canonical illustration of LB involves the various memory > > allocators, where you don't want a load from about-to-be-freed > > memory to see a store initializing a later incarnation of that > > same memory area. But the per-CPU caches make this a very > > long and complicated example. > > > > I am not aware of any three-CPU release-acquire chains in the > > Linux kernel. There are three-CPU lock-based chains in RCU, > > but these are not at all simple, either. > > > > The "Program-Order guarantees" case in scheduler? See the comments > written by Peter above try_to_wake_up(): > > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any > subsequent > * execution on its new CPU [c1]. > ... > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > > Hello! > > > > At Linux Plumbers Conference, we got requests for a recipes document, > > and a further request to point to actual code in the Linux kernel. > > I have pulled together some examples for various litmus-test families, > > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > > MP, ...) is here: > > > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > > > This document is also checked into the memory-models git archive: > > > > https://github.com/aparri/memory-model.git > > > > I would be especially interested in simpler examples in general, and > > of course any example at all for the cases where I was unable to find > > any. Thoughts? > > > > Thanx, Paul > > > > > > > > This document lists the litmus-test patterns that we have been discussing, > > along with examples from the Linux kernel. This is intended to feed into > > the recipes document. All examples are from v4.13. > > > > 0. Single-variable SC. > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > (see kernel/rcu/tree.c). The counter is accessed by > > interrupts and NMIs as well as by process-level code. > > This counter can be accessed by other CPUs, but only > > for debug output. > > > > b. Between CPUs, I would put forward the ->dflags > > updates, but this is anything but simple. But maybe > > OK for an illustration? > > > > 1. MP (see test6.pdf for nickname translation) > > > > a. smp_store_release() / smp_load_acquire() > > > > init_stack_slab() in lib/stackdepot.c uses release-acquire > > to handle initialization of a slab of the stack. Working > > out the mutual-exclusion design is left as an exercise for > > the reader. > > > > b. rcu_assign_pointer() / rcu_dereference() > > > > expand_to_next_prime() does the rcu_assign_pointer(), > > and next_prime_number() does the rcu_dereference(). > > This mediates access to a bit vector that is expanded > > as additional primes are needed. These two functions > > are in lib/prime_numbers.c. > > > > c. smp_wmb() / smp_rmb() > > > > xlog_state_switch_iclogs() contains the following: > > > > log->l_curr_block -= log->l_logBBsize; > > ASSERT(log->l_curr_block >= 0); > > smp_wmb(); > > log->l_curr_cycle++; > > > > And xlog_valid_lsn() contains the following: > > > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > > smp_rmb(); > > cur_block = ACCESS_ONCE(log->l_curr_block); > > > > d. Replacing either of the above with smp_mb() > > > > Holding off on this one for the moment... > > > > 2. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > > > Lots of variety here, can in some cases substitute: > > > > a. READ_ONCE() for smp_load_acquire() > > b. WRITE_ONCE() for smp_store_release() > > c. Dependencies for both smp_load_acquire() and > > smp_store_release(). > > d. smp_wmb() for smp_store_release() in first thread > > of ISA2 and Z6.2. > > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > > > The canonical illustration of LB involves the various memory > > allocators, where you don't want a load from about-to-be-freed > > memory to see a store initializing a later incarnation of that > > same memory area. But the per-CPU caches make this a very > > long and complicated example. > > > > I am not aware of any three-CPU release-acquire chains in the > > Linux kernel. There are three-CPU lock-based chains in RCU, > > but these are not at all simple, either. > > > > The "Program-Order guarantees" case in scheduler? See the comments > written by Peter above try_to_wake_up(): > > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any > subsequent > * execution on its new CPU [c1]. > ... > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 12:27:19PM +0100, Will Deacon wrote: > On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > > > I am not aware of any three-CPU release-acquire chains in the > > > Linux kernel. There are three-CPU lock-based chains in RCU, > > > but these are not at all simple, either. > > > > > > > The "Program-Order guarantees" case in scheduler? See the comments > > written by Peter above try_to_wake_up(): > > > > * The basic program-order guarantee on SMP systems is that when a task [t] > > * migrates, all its activity on its old CPU [c0] happens-before any > > subsequent > > * execution on its new CPU [c1]. > > ... > > * For blocking we (obviously) need to provide the same guarantee as for > > * migration. However the means are completely different as there is no lock > > * chain to provide order. Instead we do: > > * > > * 1) smp_store_release(X->on_cpu, 0) > > * 2) smp_cond_load_acquire(!X->on_cpu) > > * > > * Example: > > * > > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > > * > > * LOCK rq(0)->lock LOCK X->pi_lock > > * dequeue X > > * sched-out X > > * smp_store_release(X->on_cpu, 0); > > * > > *smp_cond_load_acquire(>on_cpu, !VAL); > > *X->state = WAKING > > *set_task_cpu(X,2) > > * > > *LOCK rq(2)->lock > > *enqueue X > > *X->state = RUNNING > > *UNLOCK rq(2)->lock > > * > > * LOCK rq(2)->lock // orders > > against CPU1 > > * sched-out Z > > * sched-in X > > * UNLOCK rq(2)->lock > > * > > *UNLOCK X->pi_lock > > * UNLOCK rq(0)->lock > > > > This is a chain mixed with lock and acquire-release(maybe even better?). > > > > > > And another example would be osq_{lock,unlock}() on multiple(more than > > three) CPUs. > > I think the qrwlock also has something similar with the writer fairness > issue fixed: > > CPU0: (writer doing an unlock) > smp_store_release(>wlocked, 0); // Bottom byte of lock->cnts > > > CPU1: (waiting writer on slowpath) > atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING); > ... > arch_spin_unlock(>wait_lock); > > > CPU2: (reader on slowpath) > arch_spin_lock(>wait_lock); > > and there's mixed-size accesses here too. Fun stuff! You had me going there until you mentioned the mixed-size accesses. ;-) Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 12:27:19PM +0100, Will Deacon wrote: > On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > > > I am not aware of any three-CPU release-acquire chains in the > > > Linux kernel. There are three-CPU lock-based chains in RCU, > > > but these are not at all simple, either. > > > > > > > The "Program-Order guarantees" case in scheduler? See the comments > > written by Peter above try_to_wake_up(): > > > > * The basic program-order guarantee on SMP systems is that when a task [t] > > * migrates, all its activity on its old CPU [c0] happens-before any > > subsequent > > * execution on its new CPU [c1]. > > ... > > * For blocking we (obviously) need to provide the same guarantee as for > > * migration. However the means are completely different as there is no lock > > * chain to provide order. Instead we do: > > * > > * 1) smp_store_release(X->on_cpu, 0) > > * 2) smp_cond_load_acquire(!X->on_cpu) > > * > > * Example: > > * > > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > > * > > * LOCK rq(0)->lock LOCK X->pi_lock > > * dequeue X > > * sched-out X > > * smp_store_release(X->on_cpu, 0); > > * > > *smp_cond_load_acquire(>on_cpu, !VAL); > > *X->state = WAKING > > *set_task_cpu(X,2) > > * > > *LOCK rq(2)->lock > > *enqueue X > > *X->state = RUNNING > > *UNLOCK rq(2)->lock > > * > > * LOCK rq(2)->lock // orders > > against CPU1 > > * sched-out Z > > * sched-in X > > * UNLOCK rq(2)->lock > > * > > *UNLOCK X->pi_lock > > * UNLOCK rq(0)->lock > > > > This is a chain mixed with lock and acquire-release(maybe even better?). > > > > > > And another example would be osq_{lock,unlock}() on multiple(more than > > three) CPUs. > > I think the qrwlock also has something similar with the writer fairness > issue fixed: > > CPU0: (writer doing an unlock) > smp_store_release(>wlocked, 0); // Bottom byte of lock->cnts > > > CPU1: (waiting writer on slowpath) > atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING); > ... > arch_spin_unlock(>wait_lock); > > > CPU2: (reader on slowpath) > arch_spin_lock(>wait_lock); > > and there's mixed-size accesses here too. Fun stuff! You had me going there until you mentioned the mixed-size accesses. ;-) Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > How about this? > > > > 0. Simple special cases > > > > If there is only one CPU on the one hand or only one variable > > on the other, the code will execute in order. There are (as > > usual) some things to be careful of: > > > > a. There are some aspects of the C language that are > > unordered. For example, the compiler can output code > > computing arguments of a multi-parameter function in > > any order it likes, or even interleaved if it so chooses. > > That parses a little oddly. I wouldn't agree that the compiler outputs > the code in any order it likes! When was the last time you talked to a compiler writer? ;-) > In fact, I wouldn't even mention the compiler at all. Just say that > (with a few exceptions) the language doesn't specify the order in which > the arguments of a function or operation should be evaluated. For > example, in the expression "f(x) + g(y)", the order in which f and g > are called is not defined; the object code is allowed to use either > order or even to interleave the computations. Nevertheless, I took your suggestion: a. There are some aspects of the C language that are unordered. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. > > b. Compilers are permitted to use the "as-if" rule. > > That is, a compiler can emit whatever code it likes, > > as long as the results appear just as if the compiler > > had followed all the relevant rules. To see this, > > compiler with a high level of optimization and run > > the debugger on the resulting binary. > > You might omit the last sentence. Furthermore, if the accesses don't > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > if it had executed in order (even for a single variable!), and if you > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > it likes. Ah, I omitted an important qualifier: b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. I have seen people (including kernel hackers) surprised by what optimizers do, so I would prefer that the last sentence remain. > > c. If there is only one variable but multiple CPUs, all > > accesses to that variable must be aligned and full sized. > > I would say that the variable is what needs to be aligned, not the > accesses. (Although, if the variable is aligned and all the accesses > are full sized, then they must necessarily be aligned as well.) I was thinking in terms of an unaligned 16-bit access to a 32-bit variable. But how about this? c. If there is only one variable but multiple CPUs, all that variable must be properly aligned and all accesses to that variable must be full sized. > > Variables that straddle cachelines or pages void your > > full-ordering warranty, as do undersized accesses that > > load from or store to only part of the variable. > > How can a variable straddle pages without also straddling cache lines? Well, a variable -can- straddle cachelines without straddling pages, which justifies the "or". Furthermore, given that cacheline sizes have been growing, but pages are still 4KB, it is probably only a matter of time. ;-) Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote: > On Tue, 17 Oct 2017, Paul E. McKenney wrote: > > > How about this? > > > > 0. Simple special cases > > > > If there is only one CPU on the one hand or only one variable > > on the other, the code will execute in order. There are (as > > usual) some things to be careful of: > > > > a. There are some aspects of the C language that are > > unordered. For example, the compiler can output code > > computing arguments of a multi-parameter function in > > any order it likes, or even interleaved if it so chooses. > > That parses a little oddly. I wouldn't agree that the compiler outputs > the code in any order it likes! When was the last time you talked to a compiler writer? ;-) > In fact, I wouldn't even mention the compiler at all. Just say that > (with a few exceptions) the language doesn't specify the order in which > the arguments of a function or operation should be evaluated. For > example, in the expression "f(x) + g(y)", the order in which f and g > are called is not defined; the object code is allowed to use either > order or even to interleave the computations. Nevertheless, I took your suggestion: a. There are some aspects of the C language that are unordered. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. > > b. Compilers are permitted to use the "as-if" rule. > > That is, a compiler can emit whatever code it likes, > > as long as the results appear just as if the compiler > > had followed all the relevant rules. To see this, > > compiler with a high level of optimization and run > > the debugger on the resulting binary. > > You might omit the last sentence. Furthermore, if the accesses don't > use READ_ONCE/WRITE_ONCE then the code might not get the same result as > if it had executed in order (even for a single variable!), and if you > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code > it likes. Ah, I omitted an important qualifier: b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. I have seen people (including kernel hackers) surprised by what optimizers do, so I would prefer that the last sentence remain. > > c. If there is only one variable but multiple CPUs, all > > accesses to that variable must be aligned and full sized. > > I would say that the variable is what needs to be aligned, not the > accesses. (Although, if the variable is aligned and all the accesses > are full sized, then they must necessarily be aligned as well.) I was thinking in terms of an unaligned 16-bit access to a 32-bit variable. But how about this? c. If there is only one variable but multiple CPUs, all that variable must be properly aligned and all accesses to that variable must be full sized. > > Variables that straddle cachelines or pages void your > > full-ordering warranty, as do undersized accesses that > > load from or store to only part of the variable. > > How can a variable straddle pages without also straddling cache lines? Well, a variable -can- straddle cachelines without straddling pages, which justifies the "or". Furthermore, given that cacheline sizes have been growing, but pages are still 4KB, it is probably only a matter of time. ;-) Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 03:27:44PM +0200, Andrea Parri wrote: > Hi Paul, > > On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote: > > Hello! > > > > At Linux Plumbers Conference, we got requests for a recipes document, > > and a further request to point to actual code in the Linux kernel. > > I have pulled together some examples for various litmus-test families, > > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > > MP, ...) is here: > > > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > > > This document is also checked into the memory-models git archive: > > > > https://github.com/aparri/memory-model.git > > > > I would be especially interested in simpler examples in general, and > > of course any example at all for the cases where I was unable to find > > any. Thoughts? > > Below are some examples we did discuss (at some point): > > The comment in kernel/events/ring_buffer.c:perf_output_put_handle() > describes instances of MP+wmb+rmb and LB+ctrl+mb. I added this as an alternative for MP and as the example for LB. > The comments in kernel/sched/core.c:try_to_wake_up() describes more > instances of MP ("plus locking") and LB (see finish_lock_switch()). This one looks a bit more messy, so I will set it aside, for the moment, anyway. > The comment in kernel/sched/core.c:task_rq_lock() describes an ins- > tance of MP+wmb+addr-acqpo. This one also looks a bit messy, so I am setting it aside as well. > The comment in include/linux/wait.h:waitqueue_active() describes an > instance of SB+mb+mb. Very good, I took this as the generic pattern for the current pair of SB examples. > 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug") > describes an instance of W+RWC+porel+mb+mb. Well, this one certainly is of historical interest. After all, it might well be the first Linux-kernel commit containing a litmus test. ;-) I put it in recipes-LKcode-63cae12bce986.txt and reference it from recipes-LKcode.txt. > [...] > > I wish we could say "any barrier (explicit or implicit) in sources > is accompanied by a comment mentioning the interested pattern...", > but life is not always this simple. ;-) Well, at least scripts/checkpatch.pl now complains if you try to add a new comment-free barrier. Not that these complaints are always paid attention to... Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 03:27:44PM +0200, Andrea Parri wrote: > Hi Paul, > > On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote: > > Hello! > > > > At Linux Plumbers Conference, we got requests for a recipes document, > > and a further request to point to actual code in the Linux kernel. > > I have pulled together some examples for various litmus-test families, > > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > > MP, ...) is here: > > > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > > > This document is also checked into the memory-models git archive: > > > > https://github.com/aparri/memory-model.git > > > > I would be especially interested in simpler examples in general, and > > of course any example at all for the cases where I was unable to find > > any. Thoughts? > > Below are some examples we did discuss (at some point): > > The comment in kernel/events/ring_buffer.c:perf_output_put_handle() > describes instances of MP+wmb+rmb and LB+ctrl+mb. I added this as an alternative for MP and as the example for LB. > The comments in kernel/sched/core.c:try_to_wake_up() describes more > instances of MP ("plus locking") and LB (see finish_lock_switch()). This one looks a bit more messy, so I will set it aside, for the moment, anyway. > The comment in kernel/sched/core.c:task_rq_lock() describes an ins- > tance of MP+wmb+addr-acqpo. This one also looks a bit messy, so I am setting it aside as well. > The comment in include/linux/wait.h:waitqueue_active() describes an > instance of SB+mb+mb. Very good, I took this as the generic pattern for the current pair of SB examples. > 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug") > describes an instance of W+RWC+porel+mb+mb. Well, this one certainly is of historical interest. After all, it might well be the first Linux-kernel commit containing a litmus test. ;-) I put it in recipes-LKcode-63cae12bce986.txt and reference it from recipes-LKcode.txt. > [...] > > I wish we could say "any barrier (explicit or implicit) in sources > is accompanied by a comment mentioning the interested pattern...", > but life is not always this simple. ;-) Well, at least scripts/checkpatch.pl now complains if you try to add a new comment-free barrier. Not that these complaints are always paid attention to... Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Tue, 17 Oct 2017, Paul E. McKenney wrote: > How about this? > > 0.Simple special cases > > If there is only one CPU on the one hand or only one variable > on the other, the code will execute in order. There are (as > usual) some things to be careful of: > > a. There are some aspects of the C language that are > unordered. For example, the compiler can output code > computing arguments of a multi-parameter function in > any order it likes, or even interleaved if it so chooses. That parses a little oddly. I wouldn't agree that the compiler outputs the code in any order it likes! In fact, I wouldn't even mention the compiler at all. Just say that (with a few exceptions) the language doesn't specify the order in which the arguments of a function or operation should be evaluated. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. > b. Compilers are permitted to use the "as-if" rule. > That is, a compiler can emit whatever code it likes, > as long as the results appear just as if the compiler > had followed all the relevant rules. To see this, > compiler with a high level of optimization and run > the debugger on the resulting binary. You might omit the last sentence. Furthermore, if the accesses don't use READ_ONCE/WRITE_ONCE then the code might not get the same result as if it had executed in order (even for a single variable!), and if you do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code it likes. > c. If there is only one variable but multiple CPUs, all > accesses to that variable must be aligned and full sized. I would say that the variable is what needs to be aligned, not the accesses. (Although, if the variable is aligned and all the accesses are full sized, then they must necessarily be aligned as well.) > Variables that straddle cachelines or pages void your > full-ordering warranty, as do undersized accesses that > load from or store to only part of the variable. How can a variable straddle pages without also straddling cache lines? Alan
Re: Linux-kernel examples for LKMM recipes
On Tue, 17 Oct 2017, Paul E. McKenney wrote: > How about this? > > 0.Simple special cases > > If there is only one CPU on the one hand or only one variable > on the other, the code will execute in order. There are (as > usual) some things to be careful of: > > a. There are some aspects of the C language that are > unordered. For example, the compiler can output code > computing arguments of a multi-parameter function in > any order it likes, or even interleaved if it so chooses. That parses a little oddly. I wouldn't agree that the compiler outputs the code in any order it likes! In fact, I wouldn't even mention the compiler at all. Just say that (with a few exceptions) the language doesn't specify the order in which the arguments of a function or operation should be evaluated. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. > b. Compilers are permitted to use the "as-if" rule. > That is, a compiler can emit whatever code it likes, > as long as the results appear just as if the compiler > had followed all the relevant rules. To see this, > compiler with a high level of optimization and run > the debugger on the resulting binary. You might omit the last sentence. Furthermore, if the accesses don't use READ_ONCE/WRITE_ONCE then the code might not get the same result as if it had executed in order (even for a single variable!), and if you do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code it likes. > c. If there is only one variable but multiple CPUs, all > accesses to that variable must be aligned and full sized. I would say that the variable is what needs to be aligned, not the accesses. (Although, if the variable is aligned and all the accesses are full sized, then they must necessarily be aligned as well.) > Variables that straddle cachelines or pages void your > full-ordering warranty, as do undersized accesses that > load from or store to only part of the variable. How can a variable straddle pages without also straddling cache lines? Alan
Re: Linux-kernel examples for LKMM recipes
On Fri, Oct 13, 2017 at 04:09:26PM -0400, Alan Stern wrote: > On Fri, 13 Oct 2017, Paul E. McKenney wrote: > > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > > > On Wed, 11 Oct 2017, Paul E. McKenney wrote: [ . . . ] > > Perhaps the recipes document should just baldly state that any execution > > having only one thread and/or having only one variable will be fully > > ordered? > > That wouldn't be a bad idea. (Although the part about only one thread > should be pretty obvious.) > > Also, you have to be a little careful because the ordering of the > execution may not always agree with the ordering of the source code. > The compiler is allowed to evaluate the arguments to a function call, > for example, in any order it likes. How about this? 0. Simple special cases If there is only one CPU on the one hand or only one variable on the other, the code will execute in order. There are (as usual) some things to be careful of: a. There are some aspects of the C language that are unordered. For example, the compiler can output code computing arguments of a multi-parameter function in any order it likes, or even interleaved if it so chooses. b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes, as long as the results appear just as if the compiler had followed all the relevant rules. To see this, compiler with a high level of optimization and run the debugger on the resulting binary. c. If there is only one variable but multiple CPUs, all accesses to that variable must be aligned and full sized. Variables that straddle cachelines or pages void your full-ordering warranty, as do undersized accesses that load from or store to only part of the variable. Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Fri, Oct 13, 2017 at 04:09:26PM -0400, Alan Stern wrote: > On Fri, 13 Oct 2017, Paul E. McKenney wrote: > > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > > > On Wed, 11 Oct 2017, Paul E. McKenney wrote: [ . . . ] > > Perhaps the recipes document should just baldly state that any execution > > having only one thread and/or having only one variable will be fully > > ordered? > > That wouldn't be a bad idea. (Although the part about only one thread > should be pretty obvious.) > > Also, you have to be a little careful because the ordering of the > execution may not always agree with the ordering of the source code. > The compiler is allowed to evaluate the arguments to a function call, > for example, in any order it likes. How about this? 0. Simple special cases If there is only one CPU on the one hand or only one variable on the other, the code will execute in order. There are (as usual) some things to be careful of: a. There are some aspects of the C language that are unordered. For example, the compiler can output code computing arguments of a multi-parameter function in any order it likes, or even interleaved if it so chooses. b. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes, as long as the results appear just as if the compiler had followed all the relevant rules. To see this, compiler with a high level of optimization and run the debugger on the resulting binary. c. If there is only one variable but multiple CPUs, all accesses to that variable must be aligned and full sized. Variables that straddle cachelines or pages void your full-ordering warranty, as do undersized accesses that load from or store to only part of the variable. Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Fri, 13 Oct 2017, Paul E. McKenney wrote: > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > > On Wed, 11 Oct 2017, Paul E. McKenney wrote: > > > > > This document lists the litmus-test patterns that we have been discussing, > > > along with examples from the Linux kernel. This is intended to feed into > > > the recipes document. All examples are from v4.13. > > > > > > 0.Single-variable SC. > > > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > > (see kernel/rcu/tree.c). The counter is accessed by > > > interrupts and NMIs as well as by process-level code. > > > This counter can be accessed by other CPUs, but only > > > for debug output. > > > > I'm not sure that single-variable SC can really be represented by an > > example. It gets used literally all over the kernel -- it's such a > > large part of the way we think about computer programs that we rely on > > it unconsciously. > > > > For example, the very first function in the very first C source file > > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) > > includes this code: > > > > if (acct->active) { > > u64 suspend = sbuf.f_blocks * SUSPEND; > > do_div(suspend, 100); > > > > How do we know that the value which gets divided by 100 is > > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in > > suspend's memory location before it was initialized? Answer: > > per-variable SC. > > > > Okay, maybe that's not really applicable, since it doesn't involve > > accesses to shared memory. Here's an example that does. > > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. > > How do we know that the value retrieved by the second call was not > > stored _earlier_ than the value retrieved by the first call? > > Per-variable SC. > > > > > b. Between CPUs, I would put forward the ->dflags > > > updates, but this is anything but simple. But maybe > > > OK for an illustration? > > > > Pretty much any code that accesses the same shared variable twice on > > the same CPU could be an example of per-variable SC. But I don't think > > people would learn much by studying such examples. > > Perhaps the recipes document should just baldly state that any execution > having only one thread and/or having only one variable will be fully > ordered? That wouldn't be a bad idea. (Although the part about only one thread should be pretty obvious.) Also, you have to be a little careful because the ordering of the execution may not always agree with the ordering of the source code. The compiler is allowed to evaluate the arguments to a function call, for example, in any order it likes. Alan
Re: Linux-kernel examples for LKMM recipes
On Fri, 13 Oct 2017, Paul E. McKenney wrote: > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > > On Wed, 11 Oct 2017, Paul E. McKenney wrote: > > > > > This document lists the litmus-test patterns that we have been discussing, > > > along with examples from the Linux kernel. This is intended to feed into > > > the recipes document. All examples are from v4.13. > > > > > > 0.Single-variable SC. > > > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > > (see kernel/rcu/tree.c). The counter is accessed by > > > interrupts and NMIs as well as by process-level code. > > > This counter can be accessed by other CPUs, but only > > > for debug output. > > > > I'm not sure that single-variable SC can really be represented by an > > example. It gets used literally all over the kernel -- it's such a > > large part of the way we think about computer programs that we rely on > > it unconsciously. > > > > For example, the very first function in the very first C source file > > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) > > includes this code: > > > > if (acct->active) { > > u64 suspend = sbuf.f_blocks * SUSPEND; > > do_div(suspend, 100); > > > > How do we know that the value which gets divided by 100 is > > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in > > suspend's memory location before it was initialized? Answer: > > per-variable SC. > > > > Okay, maybe that's not really applicable, since it doesn't involve > > accesses to shared memory. Here's an example that does. > > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. > > How do we know that the value retrieved by the second call was not > > stored _earlier_ than the value retrieved by the first call? > > Per-variable SC. > > > > > b. Between CPUs, I would put forward the ->dflags > > > updates, but this is anything but simple. But maybe > > > OK for an illustration? > > > > Pretty much any code that accesses the same shared variable twice on > > the same CPU could be an example of per-variable SC. But I don't think > > people would learn much by studying such examples. > > Perhaps the recipes document should just baldly state that any execution > having only one thread and/or having only one variable will be fully > ordered? That wouldn't be a bad idea. (Although the part about only one thread should be pretty obvious.) Also, you have to be a little careful because the ordering of the execution may not always agree with the ordering of the source code. The compiler is allowed to evaluate the arguments to a function call, for example, in any order it likes. Alan
Re: Linux-kernel examples for LKMM recipes
On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > On Wed, 11 Oct 2017, Paul E. McKenney wrote: > > > This document lists the litmus-test patterns that we have been discussing, > > along with examples from the Linux kernel. This is intended to feed into > > the recipes document. All examples are from v4.13. > > > > 0. Single-variable SC. > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > (see kernel/rcu/tree.c). The counter is accessed by > > interrupts and NMIs as well as by process-level code. > > This counter can be accessed by other CPUs, but only > > for debug output. > > I'm not sure that single-variable SC can really be represented by an > example. It gets used literally all over the kernel -- it's such a > large part of the way we think about computer programs that we rely on > it unconsciously. > > For example, the very first function in the very first C source file > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) > includes this code: > > if (acct->active) { > u64 suspend = sbuf.f_blocks * SUSPEND; > do_div(suspend, 100); > > How do we know that the value which gets divided by 100 is > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in > suspend's memory location before it was initialized? Answer: > per-variable SC. > > Okay, maybe that's not really applicable, since it doesn't involve > accesses to shared memory. Here's an example that does. > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. > How do we know that the value retrieved by the second call was not > stored _earlier_ than the value retrieved by the first call? > Per-variable SC. > > > b. Between CPUs, I would put forward the ->dflags > > updates, but this is anything but simple. But maybe > > OK for an illustration? > > Pretty much any code that accesses the same shared variable twice on > the same CPU could be an example of per-variable SC. But I don't think > people would learn much by studying such examples. Perhaps the recipes document should just baldly state that any execution having only one thread and/or having only one variable will be fully ordered? Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote: > On Wed, 11 Oct 2017, Paul E. McKenney wrote: > > > This document lists the litmus-test patterns that we have been discussing, > > along with examples from the Linux kernel. This is intended to feed into > > the recipes document. All examples are from v4.13. > > > > 0. Single-variable SC. > > > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > > (see kernel/rcu/tree.c). The counter is accessed by > > interrupts and NMIs as well as by process-level code. > > This counter can be accessed by other CPUs, but only > > for debug output. > > I'm not sure that single-variable SC can really be represented by an > example. It gets used literally all over the kernel -- it's such a > large part of the way we think about computer programs that we rely on > it unconsciously. > > For example, the very first function in the very first C source file > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) > includes this code: > > if (acct->active) { > u64 suspend = sbuf.f_blocks * SUSPEND; > do_div(suspend, 100); > > How do we know that the value which gets divided by 100 is > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in > suspend's memory location before it was initialized? Answer: > per-variable SC. > > Okay, maybe that's not really applicable, since it doesn't involve > accesses to shared memory. Here's an example that does. > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. > How do we know that the value retrieved by the second call was not > stored _earlier_ than the value retrieved by the first call? > Per-variable SC. > > > b. Between CPUs, I would put forward the ->dflags > > updates, but this is anything but simple. But maybe > > OK for an illustration? > > Pretty much any code that accesses the same shared variable twice on > the same CPU could be an example of per-variable SC. But I don't think > people would learn much by studying such examples. Perhaps the recipes document should just baldly state that any execution having only one thread and/or having only one variable will be fully ordered? Thanx, Paul
Re: Linux-kernel examples for LKMM recipes
On Wed, 11 Oct 2017, Paul E. McKenney wrote: > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0.Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. I'm not sure that single-variable SC can really be represented by an example. It gets used literally all over the kernel -- it's such a large part of the way we think about computer programs that we rely on it unconsciously. For example, the very first function in the very first C source file in the kernel/ directory (namely, check_free_space() in kernel/acct.c) includes this code: if (acct->active) { u64 suspend = sbuf.f_blocks * SUSPEND; do_div(suspend, 100); How do we know that the value which gets divided by 100 is sbuf.f_blocks * SUSPEND and not the random garbage which was stored in suspend's memory location before it was initialized? Answer: per-variable SC. Okay, maybe that's not really applicable, since it doesn't involve accesses to shared memory. Here's an example that does. get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. How do we know that the value retrieved by the second call was not stored _earlier_ than the value retrieved by the first call? Per-variable SC. > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? Pretty much any code that accesses the same shared variable twice on the same CPU could be an example of per-variable SC. But I don't think people would learn much by studying such examples. Alan
Re: Linux-kernel examples for LKMM recipes
On Wed, 11 Oct 2017, Paul E. McKenney wrote: > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0.Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. I'm not sure that single-variable SC can really be represented by an example. It gets used literally all over the kernel -- it's such a large part of the way we think about computer programs that we rely on it unconsciously. For example, the very first function in the very first C source file in the kernel/ directory (namely, check_free_space() in kernel/acct.c) includes this code: if (acct->active) { u64 suspend = sbuf.f_blocks * SUSPEND; do_div(suspend, 100); How do we know that the value which gets divided by 100 is sbuf.f_blocks * SUSPEND and not the random garbage which was stored in suspend's memory location before it was initialized? Answer: per-variable SC. Okay, maybe that's not really applicable, since it doesn't involve accesses to shared memory. Here's an example that does. get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice. How do we know that the value retrieved by the second call was not stored _earlier_ than the value retrieved by the first call? Per-variable SC. > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? Pretty much any code that accesses the same shared variable twice on the same CPU could be an example of per-variable SC. But I don't think people would learn much by studying such examples. Alan
Re: Linux-kernel examples for LKMM recipes
Hi Paul, On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote: > Hello! > > At Linux Plumbers Conference, we got requests for a recipes document, > and a further request to point to actual code in the Linux kernel. > I have pulled together some examples for various litmus-test families, > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > MP, ...) is here: > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > This document is also checked into the memory-models git archive: > > https://github.com/aparri/memory-model.git > > I would be especially interested in simpler examples in general, and > of course any example at all for the cases where I was unable to find > any. Thoughts? Below are some examples we did discuss (at some point): The comment in kernel/events/ring_buffer.c:perf_output_put_handle() describes instances of MP+wmb+rmb and LB+ctrl+mb. The comments in kernel/sched/core.c:try_to_wake_up() describes more instances of MP ("plus locking") and LB (see finish_lock_switch()). The comment in kernel/sched/core.c:task_rq_lock() describes an ins- tance of MP+wmb+addr-acqpo. The comment in include/linux/wait.h:waitqueue_active() describes an instance of SB+mb+mb. 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug") describes an instance of W+RWC+porel+mb+mb. [...] I wish we could say "any barrier (explicit or implicit) in sources is accompanied by a comment mentioning the interested pattern...", but life is not always this simple. ;-) Andrea > > Thanx, Paul > > > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0.Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. > > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? > > 1.MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > > Thoughts? > > 3.
Re: Linux-kernel examples for LKMM recipes
Hi Paul, On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote: > Hello! > > At Linux Plumbers Conference, we got requests for a recipes document, > and a further request to point to actual code in the Linux kernel. > I have pulled together some examples for various litmus-test families, > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > MP, ...) is here: > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > This document is also checked into the memory-models git archive: > > https://github.com/aparri/memory-model.git > > I would be especially interested in simpler examples in general, and > of course any example at all for the cases where I was unable to find > any. Thoughts? Below are some examples we did discuss (at some point): The comment in kernel/events/ring_buffer.c:perf_output_put_handle() describes instances of MP+wmb+rmb and LB+ctrl+mb. The comments in kernel/sched/core.c:try_to_wake_up() describes more instances of MP ("plus locking") and LB (see finish_lock_switch()). The comment in kernel/sched/core.c:task_rq_lock() describes an ins- tance of MP+wmb+addr-acqpo. The comment in include/linux/wait.h:waitqueue_active() describes an instance of SB+mb+mb. 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug") describes an instance of W+RWC+porel+mb+mb. [...] I wish we could say "any barrier (explicit or implicit) in sources is accompanied by a comment mentioning the interested pattern...", but life is not always this simple. ;-) Andrea > > Thanx, Paul > > > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0.Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. > > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? > > 1.MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > > Thoughts? > > 3.
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > > I am not aware of any three-CPU release-acquire chains in the > > Linux kernel. There are three-CPU lock-based chains in RCU, > > but these are not at all simple, either. > > > > The "Program-Order guarantees" case in scheduler? See the comments > written by Peter above try_to_wake_up(): > > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any > subsequent > * execution on its new CPU [c1]. > ... > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > * > * LOCK rq(0)->lock LOCK X->pi_lock > * dequeue X > * sched-out X > * smp_store_release(X->on_cpu, 0); > * > *smp_cond_load_acquire(>on_cpu, !VAL); > *X->state = WAKING > *set_task_cpu(X,2) > * > *LOCK rq(2)->lock > *enqueue X > *X->state = RUNNING > *UNLOCK rq(2)->lock > * > * LOCK rq(2)->lock // orders > against CPU1 > * sched-out Z > * sched-in X > * UNLOCK rq(2)->lock > * > *UNLOCK X->pi_lock > * UNLOCK rq(0)->lock > > This is a chain mixed with lock and acquire-release(maybe even better?). > > > And another example would be osq_{lock,unlock}() on multiple(more than > three) CPUs. I think the qrwlock also has something similar with the writer fairness issue fixed: CPU0: (writer doing an unlock) smp_store_release(>wlocked, 0); // Bottom byte of lock->cnts CPU1: (waiting writer on slowpath) atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING); ... arch_spin_unlock(>wait_lock); CPU2: (reader on slowpath) arch_spin_lock(>wait_lock); and there's mixed-size accesses here too. Fun stuff! Will
Re: Linux-kernel examples for LKMM recipes
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote: > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > > I am not aware of any three-CPU release-acquire chains in the > > Linux kernel. There are three-CPU lock-based chains in RCU, > > but these are not at all simple, either. > > > > The "Program-Order guarantees" case in scheduler? See the comments > written by Peter above try_to_wake_up(): > > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any > subsequent > * execution on its new CPU [c1]. > ... > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > * > * LOCK rq(0)->lock LOCK X->pi_lock > * dequeue X > * sched-out X > * smp_store_release(X->on_cpu, 0); > * > *smp_cond_load_acquire(>on_cpu, !VAL); > *X->state = WAKING > *set_task_cpu(X,2) > * > *LOCK rq(2)->lock > *enqueue X > *X->state = RUNNING > *UNLOCK rq(2)->lock > * > * LOCK rq(2)->lock // orders > against CPU1 > * sched-out Z > * sched-in X > * UNLOCK rq(2)->lock > * > *UNLOCK X->pi_lock > * UNLOCK rq(0)->lock > > This is a chain mixed with lock and acquire-release(maybe even better?). > > > And another example would be osq_{lock,unlock}() on multiple(more than > three) CPUs. I think the qrwlock also has something similar with the writer fairness issue fixed: CPU0: (writer doing an unlock) smp_store_release(>wlocked, 0); // Bottom byte of lock->cnts CPU1: (waiting writer on slowpath) atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING); ... arch_spin_unlock(>wait_lock); CPU2: (reader on slowpath) arch_spin_lock(>wait_lock); and there's mixed-size accesses here too. Fun stuff! Will
Re: Linux-kernel examples for LKMM recipes
On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > Hello! > > At Linux Plumbers Conference, we got requests for a recipes document, > and a further request to point to actual code in the Linux kernel. > I have pulled together some examples for various litmus-test families, > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > MP, ...) is here: > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > This document is also checked into the memory-models git archive: > > https://github.com/aparri/memory-model.git > > I would be especially interested in simpler examples in general, and > of course any example at all for the cases where I was unable to find > any. Thoughts? > > Thanx, Paul > > > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0.Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. > > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? > > 1.MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > The "Program-Order guarantees" case in scheduler? See the comments written by Peter above try_to_wake_up(): * The basic program-order guarantee on SMP systems is that when a task [t] * migrates, all its activity on its old CPU [c0] happens-before any subsequent * execution on its new CPU [c1]. ... * For blocking we (obviously) need to provide the same guarantee as for * migration. However the means are completely different as there is no lock * chain to provide order. Instead we do: * * 1) smp_store_release(X->on_cpu, 0) * 2) smp_cond_load_acquire(!X->on_cpu) * * Example: * * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) * * LOCK rq(0)->lock LOCK X->pi_lock * dequeue X * sched-out X * smp_store_release(X->on_cpu, 0); * *smp_cond_load_acquire(>on_cpu, !VAL); *
Re: Linux-kernel examples for LKMM recipes
On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote: > Hello! > > At Linux Plumbers Conference, we got requests for a recipes document, > and a further request to point to actual code in the Linux kernel. > I have pulled together some examples for various litmus-test families, > as shown below. The decoder ring for the abbreviations (ISA2, LB, SB, > MP, ...) is here: > > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf > > This document is also checked into the memory-models git archive: > > https://github.com/aparri/memory-model.git > > I would be especially interested in simpler examples in general, and > of course any example at all for the cases where I was unable to find > any. Thoughts? > > Thanx, Paul > > > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0.Single-variable SC. > > a. Within a single CPU, the use of the ->dynticks_nmi_nesting > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies > (see kernel/rcu/tree.c). The counter is accessed by > interrupts and NMIs as well as by process-level code. > This counter can be accessed by other CPUs, but only > for debug output. > > b. Between CPUs, I would put forward the ->dflags > updates, but this is anything but simple. But maybe > OK for an illustration? > > 1.MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > The "Program-Order guarantees" case in scheduler? See the comments written by Peter above try_to_wake_up(): * The basic program-order guarantee on SMP systems is that when a task [t] * migrates, all its activity on its old CPU [c0] happens-before any subsequent * execution on its new CPU [c1]. ... * For blocking we (obviously) need to provide the same guarantee as for * migration. However the means are completely different as there is no lock * chain to provide order. Instead we do: * * 1) smp_store_release(X->on_cpu, 0) * 2) smp_cond_load_acquire(!X->on_cpu) * * Example: * * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) * * LOCK rq(0)->lock LOCK X->pi_lock * dequeue X * sched-out X * smp_store_release(X->on_cpu, 0); * *smp_cond_load_acquire(>on_cpu, !VAL); *