Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, Oct 04, 2017 at 05:24:23PM +0200, Peter Zijlstra wrote: > On Wed, Oct 04, 2017 at 07:17:45AM -0700, Paul E. McKenney wrote: > > If you use READ_ONCE(), then all architectures I know of enforce > > full ordering for accesses to a single variable. (If you don't use > > READ_ONCE(), then in theory Itanium can reorder reads.) Me, I would > > argue for WRITE_ONCE() as well to prevent store tearing. > > Note that the stores are either cmpxchg() or smp_store_release() both of > which imply a WRITE_ONCE(). That works for me! ;-) Thanx, Paul
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, Oct 04, 2017 at 07:17:45AM -0700, Paul E. McKenney wrote: > If you use READ_ONCE(), then all architectures I know of enforce > full ordering for accesses to a single variable. (If you don't use > READ_ONCE(), then in theory Itanium can reorder reads.) Me, I would > argue for WRITE_ONCE() as well to prevent store tearing. Note that the stores are either cmpxchg() or smp_store_release() both of which imply a WRITE_ONCE().
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, Oct 04, 2017 at 04:52:47PM +0200, Peter Zijlstra wrote: > On Wed, Oct 04, 2017 at 10:43:54AM -0400, Steven Rostedt wrote: > > > > My question is not about ordering, but about coherency. Can you have > > one CPU read a variable that goes into cache, and keep using the cached > > variable every time the program asks to read it, instead of going out > > to memory. > > No, not on a coherent system. What Peter said. And if you use READ_ONCE() for the reads, than as far as I know, all the systems that the Linux kernel supports are coherent systems. Thanx, Paul
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, 4 Oct 2017 16:52:47 +0200 Peter Zijlstra wrote: > On Wed, Oct 04, 2017 at 10:43:54AM -0400, Steven Rostedt wrote: > > > > My question is not about ordering, but about coherency. Can you have > > one CPU read a variable that goes into cache, and keep using the cached > > variable every time the program asks to read it, instead of going out > > to memory. > > No, not on a coherent system. In this case. Reviewed-by: Steven Rostedt (VMware) -- Steve
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, Oct 04, 2017 at 10:43:54AM -0400, Steven Rostedt wrote: > > My question is not about ordering, but about coherency. Can you have > one CPU read a variable that goes into cache, and keep using the cached > variable every time the program asks to read it, instead of going out > to memory. No, not on a coherent system.
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, 4 Oct 2017 07:17:45 -0700 "Paul E. McKenney" wrote: > > I'm more worried about other architectures that don't have as strong of > > a cache coherency. > > > > [ Added Paul as he knows a lot about odd architectures ] > > > > Is there any architecture that we support that can have the following: > > > > CPU0CPU1 > > > > early_printk_cpu = 1 > > for (;;) > >old = READ_ONCE(early_printk_cpu); > >[ old = 1 ] > > > > early_printk_cpu = -1 > > > >[...] > >cpu_relax(); > >old = READ_ONCE(early_printk_cpu); > > > >[ but the CPU uses the cache and not the memory? ] > > > >old = 1; > > If you use READ_ONCE(), then all architectures I know of enforce > full ordering for accesses to a single variable. (If you don't use > READ_ONCE(), then in theory Itanium can reorder reads.) Me, I would > argue for WRITE_ONCE() as well to prevent store tearing. > > It is only when you have at least two variables and at least two threads > than things start getting really "interesting". ;-) > My question is not about ordering, but about coherency. Can you have one CPU read a variable that goes into cache, and keep using the cached variable every time the program asks to read it, instead of going out to memory. Also, on the other CPU, if a variable is written, and the cache is not write-through, could that variable be sitting in cache and not go out to memory until a flush happens? Do we support architectures that don't have strong coherency to know that one CPU is asking for a memory location for something that was changed in the cache of another CPU. Or a CPU will return old cache data even though the memory was updated? I guess my concern is that READ_ONCE() and cpu_relax(), don't actually do a memory barrier. They are mostly compiler barriers. Do we support poorly coherent architectures that require some kind of flush to make sure the communication exists between the two CPUs? Note, at TimeSys, I had to port Linux to a poorly coherent SMP board that would trip over this all the time. I don't know if that board is sufficient to run Linux, as we had to slap in memory barriers all over the place. But this was a 2.4 kernel at the time. -- Steve
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, Oct 04, 2017 at 09:04:01AM -0400, Steven Rostedt wrote: > On Wed, 4 Oct 2017 11:08:30 +0200 > Peter Zijlstra wrote: > > > On Tue, Oct 03, 2017 at 06:24:22PM -0400, Steven Rostedt wrote: > > > On Thu, 28 Sep 2017 14:18:26 +0200 > > > Peter Zijlstra wrote: > > > > static int early_vprintk(const char *fmt, va_list args) > > > > { > > > > + int n, cpu, old; > > > > char buf[512]; > > > > + > > > > + cpu = get_cpu(); > > > > + /* > > > > +* Test-and-Set inter-cpu spinlock with recursion. > > > > +*/ > > > > + for (;;) { > > > > + /* > > > > +* c-cas to avoid the exclusive bouncing on spin. > > > > +* Depends on the memory barrier implied by cmpxchg > > > > +* for ACQUIRE semantics. > > > > +*/ > > > > + old = READ_ONCE(early_printk_cpu); > > > > + if (old == -1) { > > > > > > If old != -1 and old != cpu, is it possible that the CPU could have > > > fetched an old value, and never try to fetch it again? > > > > What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the > > READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again, > > as does the barrier() implied by cpu_relax(). > > I'm more worried about other architectures that don't have as strong of > a cache coherency. > > [ Added Paul as he knows a lot about odd architectures ] > > Is there any architecture that we support that can have the following: > > CPU0CPU1 > > early_printk_cpu = 1 > for (;;) >old = READ_ONCE(early_printk_cpu); >[ old = 1 ] > > early_printk_cpu = -1 > >[...] >cpu_relax(); >old = READ_ONCE(early_printk_cpu); > >[ but the CPU uses the cache and not the memory? ] > >old = 1; If you use READ_ONCE(), then all architectures I know of enforce full ordering for accesses to a single variable. (If you don't use READ_ONCE(), then in theory Itanium can reorder reads.) Me, I would argue for WRITE_ONCE() as well to prevent store tearing. It is only when you have at least two variables and at least two threads than things start getting really "interesting". ;-) Thanx, Paul > -- Steve > > > > > > > The cmpxchg memory barrier only happens when old == -1. > > > > Yeah, so? > > > > > > + old = cmpxchg(&early_printk_cpu, -1, cpu); > > > > + if (old == -1) > > > > + break; > > > > + } > > > > + /* > > > > +* Allow recursion for interrupts and the like. > > > > +*/ > > > > + if (old == cpu) > > > > + break; > > > > + > > > > + cpu_relax(); > > > > + } > > > > > > > > n = vscnprintf(buf, sizeof(buf), fmt, args); > > > > early_console->write(early_console, buf, n); > > > > > > > > + /* > > > > +* Unlock -- in case @old == @cpu, this is a no-op. > > > > +*/ > > > > + smp_store_release(&early_printk_cpu, old); > > > > + put_cpu(); > > > > + > > > > return n; > > > > } >
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, Oct 04, 2017 at 09:04:01AM -0400, Steven Rostedt wrote: > > > If old != -1 and old != cpu, is it possible that the CPU could have > > > fetched an old value, and never try to fetch it again? > > > > What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the > > READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again, > > as does the barrier() implied by cpu_relax(). > > I'm more worried about other architectures that don't have as strong of > a cache coherency. Linux mandates cache-coherency, there's no weak or strong there. Memory ordering can be weak or strong, but coherency not. If this patch is broken, lots of code would be broken.
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Wed, 4 Oct 2017 11:08:30 +0200 Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 06:24:22PM -0400, Steven Rostedt wrote: > > On Thu, 28 Sep 2017 14:18:26 +0200 > > Peter Zijlstra wrote: > > > static int early_vprintk(const char *fmt, va_list args) > > > { > > > + int n, cpu, old; > > > char buf[512]; > > > + > > > + cpu = get_cpu(); > > > + /* > > > + * Test-and-Set inter-cpu spinlock with recursion. > > > + */ > > > + for (;;) { > > > + /* > > > + * c-cas to avoid the exclusive bouncing on spin. > > > + * Depends on the memory barrier implied by cmpxchg > > > + * for ACQUIRE semantics. > > > + */ > > > + old = READ_ONCE(early_printk_cpu); > > > + if (old == -1) { > > > > If old != -1 and old != cpu, is it possible that the CPU could have > > fetched an old value, and never try to fetch it again? > > What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the > READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again, > as does the barrier() implied by cpu_relax(). I'm more worried about other architectures that don't have as strong of a cache coherency. [ Added Paul as he knows a lot about odd architectures ] Is there any architecture that we support that can have the following: CPU0CPU1 early_printk_cpu = 1 for (;;) old = READ_ONCE(early_printk_cpu); [ old = 1 ] early_printk_cpu = -1 [...] cpu_relax(); old = READ_ONCE(early_printk_cpu); [ but the CPU uses the cache and not the memory? ] old = 1; -- Steve > > > The cmpxchg memory barrier only happens when old == -1. > > Yeah, so? > > > > + old = cmpxchg(&early_printk_cpu, -1, cpu); > > > + if (old == -1) > > > + break; > > > + } > > > + /* > > > + * Allow recursion for interrupts and the like. > > > + */ > > > + if (old == cpu) > > > + break; > > > + > > > + cpu_relax(); > > > + } > > > > > > n = vscnprintf(buf, sizeof(buf), fmt, args); > > > early_console->write(early_console, buf, n); > > > > > > + /* > > > + * Unlock -- in case @old == @cpu, this is a no-op. > > > + */ > > > + smp_store_release(&early_printk_cpu, old); > > > + put_cpu(); > > > + > > > return n; > > > }
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Tue, Oct 03, 2017 at 06:24:22PM -0400, Steven Rostedt wrote: > On Thu, 28 Sep 2017 14:18:26 +0200 > Peter Zijlstra wrote: > > static int early_vprintk(const char *fmt, va_list args) > > { > > + int n, cpu, old; > > char buf[512]; > > + > > + cpu = get_cpu(); > > + /* > > +* Test-and-Set inter-cpu spinlock with recursion. > > +*/ > > + for (;;) { > > + /* > > +* c-cas to avoid the exclusive bouncing on spin. > > +* Depends on the memory barrier implied by cmpxchg > > +* for ACQUIRE semantics. > > +*/ > > + old = READ_ONCE(early_printk_cpu); > > + if (old == -1) { > > If old != -1 and old != cpu, is it possible that the CPU could have > fetched an old value, and never try to fetch it again? What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again, as does the barrier() implied by cpu_relax(). > The cmpxchg memory barrier only happens when old == -1. Yeah, so? > > + old = cmpxchg(&early_printk_cpu, -1, cpu); > > + if (old == -1) > > + break; > > + } > > + /* > > +* Allow recursion for interrupts and the like. > > +*/ > > + if (old == cpu) > > + break; > > + > > + cpu_relax(); > > + } > > > > n = vscnprintf(buf, sizeof(buf), fmt, args); > > early_console->write(early_console, buf, n); > > > > + /* > > +* Unlock -- in case @old == @cpu, this is a no-op. > > +*/ > > + smp_store_release(&early_printk_cpu, old); > > + put_cpu(); > > + > > return n; > > }
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Thu, 28 Sep 2017 14:18:26 +0200 Peter Zijlstra wrote: > In order to avoid multiple CPUs banging on the serial port at the same > time, add simple serialization. This explicitly deals with nested > contexts (like IRQs etc.). > > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/printk/printk.c | 35 ++- > 1 file changed, 34 insertions(+), 1 deletion(-) > > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -378,14 +378,47 @@ static int __init force_early_printk_set > } > early_param("force_early_printk", force_early_printk_setup); > > +static int early_printk_cpu = -1; > + > static int early_vprintk(const char *fmt, va_list args) > { > + int n, cpu, old; > char buf[512]; > - int n; > + > + cpu = get_cpu(); > + /* > + * Test-and-Set inter-cpu spinlock with recursion. > + */ > + for (;;) { > + /* > + * c-cas to avoid the exclusive bouncing on spin. > + * Depends on the memory barrier implied by cmpxchg > + * for ACQUIRE semantics. > + */ > + old = READ_ONCE(early_printk_cpu); > + if (old == -1) { If old != -1 and old != cpu, is it possible that the CPU could have fetched an old value, and never try to fetch it again? The cmpxchg memory barrier only happens when old == -1. -- Steve > + old = cmpxchg(&early_printk_cpu, -1, cpu); > + if (old == -1) > + break; > + } > + /* > + * Allow recursion for interrupts and the like. > + */ > + if (old == cpu) > + break; > + > + cpu_relax(); > + } > > n = vscnprintf(buf, sizeof(buf), fmt, args); > early_console->write(early_console, buf, n); > > + /* > + * Unlock -- in case @old == @cpu, this is a no-op. > + */ > + smp_store_release(&early_printk_cpu, old); > + put_cpu(); > + > return n; > } > >
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Tue 2016-10-18 19:08:33, Peter Zijlstra wrote: > In order to avoid multiple CPUs banging on the serial port at the same > time, add simple serialization. This explicitly deals with nested > contexts (like IRQs etc.). > > Signed-off-by: Peter Zijlstra (Intel) Makes sense. Just a small comment below. Reviewd-by: Petr Mladek > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -356,14 +356,28 @@ static int __init force_early_printk_set > } > early_param("force_early_printk", force_early_printk_setup); > > +static int early_printk_cpu = -1; [...] > + for (;;) { > + old = cmpxchg(&early_printk_cpu, -1, cpu); > + if (old == -1 || old == cpu) > + break; > + > + cpu_relax(); > + } > > n = vscnprintf(buf, sizeof(buf), fmt, args); > early_console->write(early_console, buf, n); > > + smp_store_release(&early_printk_cpu, old); checkpatch.pl complains about using a barrier without a comment. The code is simple but it still might help to add something like: /* Releasing early_printk_cpu custom lock. */ Best Regards, Petr
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Tue, 18 Oct 2016 19:30:46 +0200 Peter Zijlstra wrote: > On Tue, Oct 18, 2016 at 01:19:51PM -0400, Steven Rostedt wrote: > Looking at this again, we really should not spin using cmpxchg(), that > thrashes the cacheline. > > The below is slightly better spinning... then again, this isn't > performance code. Yeah, this is similar to the hack I had. May want to fold this in your previous version. -- Steve > > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -376,10 +376,16 @@ static int early_vprintk(const char *fmt > > cpu = get_cpu(); > for (;;) { > - old = cmpxchg(&early_printk_cpu, -1, cpu); > - if (old == -1 || old == cpu) > + old = READ_ONCE(early_printk_cpu); > + if (old == cpu) > break; > > + if (old == -1) { > + old = cmpxchg(&early_printk_cpu, -1, cpu); > + if (old == -1 || old == cpu) > + break; > + } > + > cpu_relax(); > } >
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Tue, Oct 18, 2016 at 01:19:51PM -0400, Steven Rostedt wrote: > > hehe, I have the exact same patch for my -rt work. > > +static int early_printk_cpu = -1; > > + > > static int early_vprintk(const char *fmt, va_list args) > > { > > + int n, cpu, old; > > char buf[512]; > > + > > + cpu = get_cpu(); > > + for (;;) { > > + old = cmpxchg(&early_printk_cpu, -1, cpu); > > + if (old == -1 || old == cpu) > > + break; Looking at this again, we really should not spin using cmpxchg(), that thrashes the cacheline. The below is slightly better spinning... then again, this isn't performance code. --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -376,10 +376,16 @@ static int early_vprintk(const char *fmt cpu = get_cpu(); for (;;) { - old = cmpxchg(&early_printk_cpu, -1, cpu); - if (old == -1 || old == cpu) + old = READ_ONCE(early_printk_cpu); + if (old == cpu) break; + if (old == -1) { + old = cmpxchg(&early_printk_cpu, -1, cpu); + if (old == -1 || old == cpu) + break; + } + cpu_relax(); }
Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()
On Tue, 18 Oct 2016 19:08:33 +0200 Peter Zijlstra wrote: > In order to avoid multiple CPUs banging on the serial port at the same > time, add simple serialization. This explicitly deals with nested > contexts (like IRQs etc.). > > Signed-off-by: Peter Zijlstra (Intel) > --- hehe, I have the exact same patch for my -rt work. -- Steve > kernel/printk/printk.c | 16 +++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -356,14 +356,28 @@ static int __init force_early_printk_set > } > early_param("force_early_printk", force_early_printk_setup); > > +static int early_printk_cpu = -1; > + > static int early_vprintk(const char *fmt, va_list args) > { > + int n, cpu, old; > char buf[512]; > - int n; > + > + cpu = get_cpu(); > + for (;;) { > + old = cmpxchg(&early_printk_cpu, -1, cpu); > + if (old == -1 || old == cpu) > + break; > + > + cpu_relax(); > + } > > n = vscnprintf(buf, sizeof(buf), fmt, args); > early_console->write(early_console, buf, n); > > + smp_store_release(&early_printk_cpu, old); > + put_cpu(); > + > return n; > } > >