Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Paul E. McKenney
On Wed, Oct 04, 2017 at 05:24:23PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 07:17:45AM -0700, Paul E. McKenney wrote:
> > If you use READ_ONCE(), then all architectures I know of enforce
> > full ordering for accesses to a single variable.  (If you don't use
> > READ_ONCE(), then in theory Itanium can reorder reads.)  Me, I would
> > argue for WRITE_ONCE() as well to prevent store tearing.
> 
> Note that the stores are either cmpxchg() or smp_store_release() both of
> which imply a WRITE_ONCE().

That works for me!  ;-)

Thanx, Paul



Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Peter Zijlstra
On Wed, Oct 04, 2017 at 07:17:45AM -0700, Paul E. McKenney wrote:
> If you use READ_ONCE(), then all architectures I know of enforce
> full ordering for accesses to a single variable.  (If you don't use
> READ_ONCE(), then in theory Itanium can reorder reads.)  Me, I would
> argue for WRITE_ONCE() as well to prevent store tearing.

Note that the stores are either cmpxchg() or smp_store_release() both of
which imply a WRITE_ONCE().


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Paul E. McKenney
On Wed, Oct 04, 2017 at 04:52:47PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 10:43:54AM -0400, Steven Rostedt wrote:
> > 
> > My question is not about ordering, but about coherency. Can you have
> > one CPU read a variable that goes into cache, and keep using the cached
> > variable every time the program asks to read it, instead of going out
> > to memory.
> 
> No, not on a coherent system.

What Peter said.  And if you use READ_ONCE() for the reads, than as
far as I know, all the systems that the Linux kernel supports are
coherent systems.

Thanx, Paul



Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2017 16:52:47 +0200
Peter Zijlstra  wrote:

> On Wed, Oct 04, 2017 at 10:43:54AM -0400, Steven Rostedt wrote:
> > 
> > My question is not about ordering, but about coherency. Can you have
> > one CPU read a variable that goes into cache, and keep using the cached
> > variable every time the program asks to read it, instead of going out
> > to memory.  
> 
> No, not on a coherent system.

In this case.

Reviewed-by: Steven Rostedt (VMware) 

-- Steve


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Peter Zijlstra
On Wed, Oct 04, 2017 at 10:43:54AM -0400, Steven Rostedt wrote:
> 
> My question is not about ordering, but about coherency. Can you have
> one CPU read a variable that goes into cache, and keep using the cached
> variable every time the program asks to read it, instead of going out
> to memory.

No, not on a coherent system.


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2017 07:17:45 -0700
"Paul E. McKenney"  wrote:

> > I'm more worried about other architectures that don't have as strong of
> > a cache coherency.
> > 
> > [ Added Paul as he knows a lot about odd architectures ]
> > 
> > Is there any architecture that we support that can have the following:
> > 
> > CPU0CPU1
> > 
> > early_printk_cpu = 1
> >  for (;;)
> >old = READ_ONCE(early_printk_cpu);
> >[ old = 1 ]
> > 
> > early_printk_cpu = -1
> > 
> >[...]
> >cpu_relax();
> >old = READ_ONCE(early_printk_cpu);
> > 
> >[ but the CPU uses the cache and not the memory? ]
> > 
> >old = 1;  
> 
> If you use READ_ONCE(), then all architectures I know of enforce
> full ordering for accesses to a single variable.  (If you don't use
> READ_ONCE(), then in theory Itanium can reorder reads.)  Me, I would
> argue for WRITE_ONCE() as well to prevent store tearing.
> 
> It is only when you have at least two variables and at least two threads
> than things start getting really "interesting".  ;-)
> 

My question is not about ordering, but about coherency. Can you have
one CPU read a variable that goes into cache, and keep using the cached
variable every time the program asks to read it, instead of going out
to memory.

Also, on the other CPU, if a variable is written, and the cache is not
write-through, could that variable be sitting in cache and not go out
to memory until a flush happens?

Do we support architectures that don't have strong coherency to know
that one CPU is asking for a memory location for something that was
changed in the cache of another CPU. Or a CPU will return old cache data
even though the memory was updated?

I guess my concern is that READ_ONCE() and cpu_relax(), don't actually
do a memory barrier. They are mostly compiler barriers. Do we support
poorly coherent architectures that require some kind of flush to make
sure the communication exists between the two CPUs?

Note, at TimeSys, I had to port Linux to a poorly coherent SMP board
that would trip over this all the time. I don't know if that board is
sufficient to run Linux, as we had to slap in memory barriers all over
the place. But this was a 2.4 kernel at the time.

-- Steve


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Paul E. McKenney
On Wed, Oct 04, 2017 at 09:04:01AM -0400, Steven Rostedt wrote:
> On Wed, 4 Oct 2017 11:08:30 +0200
> Peter Zijlstra  wrote:
> 
> > On Tue, Oct 03, 2017 at 06:24:22PM -0400, Steven Rostedt wrote:
> > > On Thu, 28 Sep 2017 14:18:26 +0200
> > > Peter Zijlstra  wrote:  
> > > >  static int early_vprintk(const char *fmt, va_list args)
> > > >  {
> > > > +   int n, cpu, old;
> > > > char buf[512];
> > > > +
> > > > +   cpu = get_cpu();
> > > > +   /*
> > > > +* Test-and-Set inter-cpu spinlock with recursion.
> > > > +*/
> > > > +   for (;;) {
> > > > +   /*
> > > > +* c-cas to avoid the exclusive bouncing on spin.
> > > > +* Depends on the memory barrier implied by cmpxchg
> > > > +* for ACQUIRE semantics.
> > > > +*/
> > > > +   old = READ_ONCE(early_printk_cpu);
> > > > +   if (old == -1) {  
> > > 
> > > If old != -1 and old != cpu, is it possible that the CPU could have
> > > fetched an old value, and never try to fetch it again?  
> > 
> > What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the
> > READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again,
> > as does the barrier() implied by cpu_relax().
> 
> I'm more worried about other architectures that don't have as strong of
> a cache coherency.
> 
> [ Added Paul as he knows a lot about odd architectures ]
> 
> Is there any architecture that we support that can have the following:
> 
>   CPU0CPU1
>   
>   early_printk_cpu = 1
>  for (;;)
>old = READ_ONCE(early_printk_cpu);
>[ old = 1 ]
> 
>   early_printk_cpu = -1
> 
>[...]
>cpu_relax();
>old = READ_ONCE(early_printk_cpu);
> 
>[ but the CPU uses the cache and not the memory? ]
> 
>old = 1;

If you use READ_ONCE(), then all architectures I know of enforce
full ordering for accesses to a single variable.  (If you don't use
READ_ONCE(), then in theory Itanium can reorder reads.)  Me, I would
argue for WRITE_ONCE() as well to prevent store tearing.

It is only when you have at least two variables and at least two threads
than things start getting really "interesting".  ;-)

Thanx, Paul

> -- Steve
> 
> 
> > 
> > > The cmpxchg memory barrier only happens when old == -1.  
> > 
> > Yeah, so?
> > 
> > > > +   old = cmpxchg(&early_printk_cpu, -1, cpu);
> > > > +   if (old == -1)
> > > > +   break;
> > > > +   }
> > > > +   /*
> > > > +* Allow recursion for interrupts and the like.
> > > > +*/
> > > > +   if (old == cpu)
> > > > +   break;
> > > > +
> > > > +   cpu_relax();
> > > > +   }
> > > >  
> > > > n = vscnprintf(buf, sizeof(buf), fmt, args);
> > > > early_console->write(early_console, buf, n);
> > > >  
> > > > +   /*
> > > > +* Unlock -- in case @old == @cpu, this is a no-op.
> > > > +*/
> > > > +   smp_store_release(&early_printk_cpu, old);
> > > > +   put_cpu();
> > > > +
> > > > return n;
> > > >  }  
> 



Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Peter Zijlstra
On Wed, Oct 04, 2017 at 09:04:01AM -0400, Steven Rostedt wrote:

> > > If old != -1 and old != cpu, is it possible that the CPU could have
> > > fetched an old value, and never try to fetch it again?  
> > 
> > What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the
> > READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again,
> > as does the barrier() implied by cpu_relax().
> 
> I'm more worried about other architectures that don't have as strong of
> a cache coherency.

Linux mandates cache-coherency, there's no weak or strong there. Memory
ordering can be weak or strong, but coherency not.

If this patch is broken, lots of code would be broken.


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2017 11:08:30 +0200
Peter Zijlstra  wrote:

> On Tue, Oct 03, 2017 at 06:24:22PM -0400, Steven Rostedt wrote:
> > On Thu, 28 Sep 2017 14:18:26 +0200
> > Peter Zijlstra  wrote:  
> > >  static int early_vprintk(const char *fmt, va_list args)
> > >  {
> > > + int n, cpu, old;
> > >   char buf[512];
> > > +
> > > + cpu = get_cpu();
> > > + /*
> > > +  * Test-and-Set inter-cpu spinlock with recursion.
> > > +  */
> > > + for (;;) {
> > > + /*
> > > +  * c-cas to avoid the exclusive bouncing on spin.
> > > +  * Depends on the memory barrier implied by cmpxchg
> > > +  * for ACQUIRE semantics.
> > > +  */
> > > + old = READ_ONCE(early_printk_cpu);
> > > + if (old == -1) {  
> > 
> > If old != -1 and old != cpu, is it possible that the CPU could have
> > fetched an old value, and never try to fetch it again?  
> 
> What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the
> READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again,
> as does the barrier() implied by cpu_relax().

I'm more worried about other architectures that don't have as strong of
a cache coherency.

[ Added Paul as he knows a lot about odd architectures ]

Is there any architecture that we support that can have the following:

CPU0CPU1

early_printk_cpu = 1
 for (;;)
   old = READ_ONCE(early_printk_cpu);
   [ old = 1 ]

early_printk_cpu = -1

   [...]
   cpu_relax();
   old = READ_ONCE(early_printk_cpu);

   [ but the CPU uses the cache and not the memory? ]

   old = 1;

-- Steve


> 
> > The cmpxchg memory barrier only happens when old == -1.  
> 
> Yeah, so?
> 
> > > + old = cmpxchg(&early_printk_cpu, -1, cpu);
> > > + if (old == -1)
> > > + break;
> > > + }
> > > + /*
> > > +  * Allow recursion for interrupts and the like.
> > > +  */
> > > + if (old == cpu)
> > > + break;
> > > +
> > > + cpu_relax();
> > > + }
> > >  
> > >   n = vscnprintf(buf, sizeof(buf), fmt, args);
> > >   early_console->write(early_console, buf, n);
> > >  
> > > + /*
> > > +  * Unlock -- in case @old == @cpu, this is a no-op.
> > > +  */
> > > + smp_store_release(&early_printk_cpu, old);
> > > + put_cpu();
> > > +
> > >   return n;
> > >  }  



Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-04 Thread Peter Zijlstra
On Tue, Oct 03, 2017 at 06:24:22PM -0400, Steven Rostedt wrote:
> On Thu, 28 Sep 2017 14:18:26 +0200
> Peter Zijlstra  wrote:
> >  static int early_vprintk(const char *fmt, va_list args)
> >  {
> > +   int n, cpu, old;
> > char buf[512];
> > +
> > +   cpu = get_cpu();
> > +   /*
> > +* Test-and-Set inter-cpu spinlock with recursion.
> > +*/
> > +   for (;;) {
> > +   /*
> > +* c-cas to avoid the exclusive bouncing on spin.
> > +* Depends on the memory barrier implied by cmpxchg
> > +* for ACQUIRE semantics.
> > +*/
> > +   old = READ_ONCE(early_printk_cpu);
> > +   if (old == -1) {
> 
> If old != -1 and old != cpu, is it possible that the CPU could have
> fetched an old value, and never try to fetch it again?

What? If old != -1 and old != cpu, we'll hit the cpu_relax() and do the
READ_ONCE() again. The READ_ONCE() guarantees we'll do the load again,
as does the barrier() implied by cpu_relax().

> The cmpxchg memory barrier only happens when old == -1.

Yeah, so?

> > +   old = cmpxchg(&early_printk_cpu, -1, cpu);
> > +   if (old == -1)
> > +   break;
> > +   }
> > +   /*
> > +* Allow recursion for interrupts and the like.
> > +*/
> > +   if (old == cpu)
> > +   break;
> > +
> > +   cpu_relax();
> > +   }
> >  
> > n = vscnprintf(buf, sizeof(buf), fmt, args);
> > early_console->write(early_console, buf, n);
> >  
> > +   /*
> > +* Unlock -- in case @old == @cpu, this is a no-op.
> > +*/
> > +   smp_store_release(&early_printk_cpu, old);
> > +   put_cpu();
> > +
> > return n;
> >  }


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2017-10-03 Thread Steven Rostedt
On Thu, 28 Sep 2017 14:18:26 +0200
Peter Zijlstra  wrote:

> In order to avoid multiple CPUs banging on the serial port at the same
> time, add simple serialization. This explicitly deals with nested
> contexts (like IRQs etc.).
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/printk/printk.c |   35 ++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -378,14 +378,47 @@ static int __init force_early_printk_set
>  }
>  early_param("force_early_printk", force_early_printk_setup);
>  
> +static int early_printk_cpu = -1;
> +
>  static int early_vprintk(const char *fmt, va_list args)
>  {
> + int n, cpu, old;
>   char buf[512];
> - int n;
> +
> + cpu = get_cpu();
> + /*
> +  * Test-and-Set inter-cpu spinlock with recursion.
> +  */
> + for (;;) {
> + /*
> +  * c-cas to avoid the exclusive bouncing on spin.
> +  * Depends on the memory barrier implied by cmpxchg
> +  * for ACQUIRE semantics.
> +  */
> + old = READ_ONCE(early_printk_cpu);
> + if (old == -1) {

If old != -1 and old != cpu, is it possible that the CPU could have
fetched an old value, and never try to fetch it again?

The cmpxchg memory barrier only happens when old == -1.

-- Steve

> + old = cmpxchg(&early_printk_cpu, -1, cpu);
> + if (old == -1)
> + break;
> + }
> + /*
> +  * Allow recursion for interrupts and the like.
> +  */
> + if (old == cpu)
> + break;
> +
> + cpu_relax();
> + }
>  
>   n = vscnprintf(buf, sizeof(buf), fmt, args);
>   early_console->write(early_console, buf, n);
>  
> + /*
> +  * Unlock -- in case @old == @cpu, this is a no-op.
> +  */
> + smp_store_release(&early_printk_cpu, old);
> + put_cpu();
> +
>   return n;
>  }
>  
> 



Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2016-11-29 Thread Petr Mladek
On Tue 2016-10-18 19:08:33, Peter Zijlstra wrote:
> In order to avoid multiple CPUs banging on the serial port at the same
> time, add simple serialization. This explicitly deals with nested
> contexts (like IRQs etc.).
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Makes sense. Just a small comment below.

Reviewd-by: Petr Mladek 

> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -356,14 +356,28 @@ static int __init force_early_printk_set
>  }
>  early_param("force_early_printk", force_early_printk_setup);
>  
> +static int early_printk_cpu = -1;

[...]

> + for (;;) {
> + old = cmpxchg(&early_printk_cpu, -1, cpu);
> + if (old == -1 || old == cpu)
> + break;
> +
> + cpu_relax();
> + }
>  
>   n = vscnprintf(buf, sizeof(buf), fmt, args);
>   early_console->write(early_console, buf, n);
>  
> + smp_store_release(&early_printk_cpu, old);

checkpatch.pl complains about using a barrier without a comment.
The code is simple but it still might help to add something like:

/* Releasing early_printk_cpu custom lock. */

Best Regards,
Petr


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2016-10-18 Thread Steven Rostedt
On Tue, 18 Oct 2016 19:30:46 +0200
Peter Zijlstra  wrote:

> On Tue, Oct 18, 2016 at 01:19:51PM -0400, Steven Rostedt wrote:

> Looking at this again, we really should not spin using cmpxchg(), that
> thrashes the cacheline.
> 
> The below is slightly better spinning... then again, this isn't
> performance code.

Yeah, this is similar to the hack I had.

May want to fold this in your previous version.

-- Steve

> 
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -376,10 +376,16 @@ static int early_vprintk(const char *fmt
>  
>   cpu = get_cpu();
>   for (;;) {
> - old = cmpxchg(&early_printk_cpu, -1, cpu);
> - if (old == -1 || old == cpu)
> + old = READ_ONCE(early_printk_cpu);
> + if (old == cpu)
>   break;
>  
> + if (old == -1) {
> + old = cmpxchg(&early_printk_cpu, -1, cpu);
> + if (old == -1 || old == cpu)
> + break;
> + }
> +
>   cpu_relax();
>   }
>  



Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2016-10-18 Thread Peter Zijlstra
On Tue, Oct 18, 2016 at 01:19:51PM -0400, Steven Rostedt wrote:
> 
> hehe, I have the exact same patch for my -rt work.

> > +static int early_printk_cpu = -1;
> > +
> >  static int early_vprintk(const char *fmt, va_list args)
> >  {
> > +   int n, cpu, old;
> > char buf[512];
> > +
> > +   cpu = get_cpu();
> > +   for (;;) {
> > +   old = cmpxchg(&early_printk_cpu, -1, cpu);
> > +   if (old == -1 || old == cpu)
> > +   break;

Looking at this again, we really should not spin using cmpxchg(), that
thrashes the cacheline.

The below is slightly better spinning... then again, this isn't
performance code.

--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -376,10 +376,16 @@ static int early_vprintk(const char *fmt
 
cpu = get_cpu();
for (;;) {
-   old = cmpxchg(&early_printk_cpu, -1, cpu);
-   if (old == -1 || old == cpu)
+   old = READ_ONCE(early_printk_cpu);
+   if (old == cpu)
break;
 
+   if (old == -1) {
+   old = cmpxchg(&early_printk_cpu, -1, cpu);
+   if (old == -1 || old == cpu)
+   break;
+   }
+
cpu_relax();
}
 


Re: [PATCH 3/3] early_printk: Add simple serialization to early_vprintk()

2016-10-18 Thread Steven Rostedt
On Tue, 18 Oct 2016 19:08:33 +0200
Peter Zijlstra  wrote:

> In order to avoid multiple CPUs banging on the serial port at the same
> time, add simple serialization. This explicitly deals with nested
> contexts (like IRQs etc.).
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---

hehe, I have the exact same patch for my -rt work.

-- Steve

>  kernel/printk/printk.c |   16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -356,14 +356,28 @@ static int __init force_early_printk_set
>  }
>  early_param("force_early_printk", force_early_printk_setup);
>  
> +static int early_printk_cpu = -1;
> +
>  static int early_vprintk(const char *fmt, va_list args)
>  {
> + int n, cpu, old;
>   char buf[512];
> - int n;
> +
> + cpu = get_cpu();
> + for (;;) {
> + old = cmpxchg(&early_printk_cpu, -1, cpu);
> + if (old == -1 || old == cpu)
> + break;
> +
> + cpu_relax();
> + }
>  
>   n = vscnprintf(buf, sizeof(buf), fmt, args);
>   early_console->write(early_console, buf, n);
>  
> + smp_store_release(&early_printk_cpu, old);
> + put_cpu();
> +
>   return n;
>  }
>  
>