Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Mathieu Desnoyers
* Peter Zijlstra (pet...@infradead.org) wrote:

[...]

Hi Peter,

Looking at this simplified version of perf's ring buffer
synchronization, I get concerned about the following issue:

> /*
>  * One important detail is that the kbuf part and the kbuf_writer() are
>  * strictly per cpu and we can thus rely on program order for those.
>  *
>  * Only the userspace consumer can possibly run on another cpu, and thus we
>  * need to ensure data consistency for those.
>  */
> 
> struct buffer {
> u64 size;
> u64 tail;
> u64 head;
> void *data;
> };
> 
> struct buffer *kbuf, *ubuf;
> 
> /*
>  * If there's space in the buffer; store the data @buf; otherwise
>  * discard it.
>  */
> void kbuf_write(int sz, void *buf)
> {
>   u64 tail, head, offset;
> 
>   do {
>   tail = ACCESS_ONCE(ubuf->tail);
>   offset = head = kbuf->head;
>   if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
>   /* discard @buf */
>   return;
>   }
>   head += sz;
>   } while (local_cmpxchg(>head, offset, head) != offset)
> 

Let's suppose we have a thread executing kbuf_write(), interrupted by an
IRQ or NMI right after a successful local_cmpxchg() (space reservation
in the buffer). If the nested execution context also calls kbuf_write(),
it will therefore update ubuf->head (below) with the second reserved
space, and only after that will it return to the original thread context
and continue executing kbuf_write(), thus overwriting ubuf->head with
the prior-to-last reserved offset.

All this probably works OK most of the times, when we have an event flow
guaranteeing that a following event will fix things up, but there
appears to be a risk of losing events near the end of the trace when
those are in nested execution contexts.

Thoughts ?

Thanks,

Mathieu

> /*
>  * Ensure that if we see the userspace tail (ubuf->tail) such
>  * that there is space to write @buf without overwriting data
>  * userspace hasn't seen yet, we won't in fact store data before
>  * that read completes.
>  */
> 
> smp_mb(); /* A, matches with D */
> 
> memcpy(kbuf->data + offset, buf, sz);
> 
> /*
>  * Ensure that we write all the @buf data before we update the
>  * userspace visible ubuf->head pointer.
>  */
> smp_wmb(); /* B, matches with C */
> 
> ubuf->head = kbuf->head;
> }
> 
> /*
>  * Consume the buffer data and update the tail pointer to indicate to
>  * kernel space there's 'free' space.
>  */
> void ubuf_read(void)
> {
> u64 head, tail;
> 
> tail = ACCESS_ONCE(ubuf->tail);
> head = ACCESS_ONCE(ubuf->head);
> 
> /*
>  * Ensure we read the buffer boundaries before the actual buffer
>  * data...
>  */
> smp_rmb(); /* C, matches with B */
> 
> while (tail != head) {
> obj = ubuf->data + tail;
> /* process obj */
> tail += obj->size;
> tail %= ubuf->size;
> }
> 
> /*
>  * Ensure all data reads are complete before we issue the
>  * ubuf->tail update; once that update hits, kbuf_write() can
>  * observe and overwrite data.
>  */
> smp_mb(); /* D, matches with A */
> 
> ubuf->tail = tail;
> }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Peter Zijlstra
On Thu, Nov 07, 2013 at 11:17:41AM +, Will Deacon wrote:
> Hi Peter,
> 
> Couple of minor fixes on the arm64 side...
> 
> On Wed, Nov 06, 2013 at 01:57:36PM +, Peter Zijlstra wrote:
> > --- a/arch/arm64/include/asm/barrier.h
> > +++ b/arch/arm64/include/asm/barrier.h
> > @@ -35,11 +35,59 @@
> >  #define smp_mb()   barrier()
> >  #define smp_rmb()  barrier()
> >  #define smp_wmb()  barrier()
> > +
> > +#define smp_store_release(p, v)
> > \
> > +do {   \
> > +   compiletime_assert_atomic_type(*p); \
> > +   smp_mb();   \
> > +   ACCESS_ONCE(*p) = (v);  \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)\
> > +({ \
> > +   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> > +   compiletime_assert_atomic_type(*p); \
> > +   smp_mb();   \
> > +   ___p1;  \
> > +})
> > +
> >  #else
> > +
> >  #define smp_mb()   asm volatile("dmb ish" : : : "memory")
> >  #define smp_rmb()  asm volatile("dmb ishld" : : : "memory")
> >  #define smp_wmb()  asm volatile("dmb ishst" : : : "memory")
> > -#endif
> 
> Why are you getting rid of this #endif?

oops..

> > +#define smp_store_release(p, v)
> > \
> > +do {   \
> > +   compiletime_assert_atomic_type(*p); \
> > +   switch (sizeof(*p)) {   \
> > +   case 4: \
> > +   asm volatile ("stlr %w1, [%0]"  \
> > +   : "=Q" (*p) : "r" (v) : "memory");  \
> > +   break;  \
> > +   case 8: \
> > +   asm volatile ("stlr %1, [%0]"   \
> > +   : "=Q" (*p) : "r" (v) : "memory");  \
> > +   break;  \
> > +   }   \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)\
> > +({ \
> > +   typeof(*p) ___p1;   \
> > +   compiletime_assert_atomic_type(*p); \
> > +   switch (sizeof(*p)) {   \
> > +   case 4: \
> > +   asm volatile ("ldar %w0, [%1]"  \
> > +   : "=r" (___p1) : "Q" (*p) : "memory");  \
> > +   break;  \
> > +   case 8: \
> > +   asm volatile ("ldar %0, [%1]"   \
> > +   : "=r" (___p1) : "Q" (*p) : "memory");  \
> > +   break;  \
> > +   }   \
> > +   ___p1;  \
> > +})
> 
> You don't need the square brackets when using the "Q" constraint (otherwise
> it will expand to something like [[x0]], which gas won't accept).
> 
> With those changes, for the general idea and arm/arm64 parts:
> 
>   Acked-by: Will Deacon 

Thanks, I did that split-up I talked about yesterday, I was going to
compile them for all archs I have a compiler for before posting again.



---
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -62,11 +62,11 @@ do {
\
compiletime_assert_atomic_type(*p); \
switch (sizeof(*p)) {   \
case 4: \
-   asm volatile ("stlr %w1, [%0]"  \
+   asm volatile ("stlr %w1, %0"\
: "=Q" (*p) : "r" (v) : "memory");  \
break;  \
case 8:   

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Will Deacon
Hi Peter,

Couple of minor fixes on the arm64 side...

On Wed, Nov 06, 2013 at 01:57:36PM +, Peter Zijlstra wrote:
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -35,11 +35,59 @@
>  #define smp_mb()   barrier()
>  #define smp_rmb()  barrier()
>  #define smp_wmb()  barrier()
> +
> +#define smp_store_release(p, v)  
>   \
> +do {   \
> +   compiletime_assert_atomic_type(*p); \
> +   smp_mb();   \
> +   ACCESS_ONCE(*p) = (v);  \
> +} while (0)
> +
> +#define smp_load_acquire(p)\
> +({ \
> +   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> +   compiletime_assert_atomic_type(*p); \
> +   smp_mb();   \
> +   ___p1;  \
> +})
> +
>  #else
> +
>  #define smp_mb()   asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()  asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()  asm volatile("dmb ishst" : : : "memory")
> -#endif

Why are you getting rid of this #endif?

> +#define smp_store_release(p, v)  
>   \
> +do {   \
> +   compiletime_assert_atomic_type(*p); \
> +   switch (sizeof(*p)) {   \
> +   case 4: \
> +   asm volatile ("stlr %w1, [%0]"  \
> +   : "=Q" (*p) : "r" (v) : "memory");  \
> +   break;  \
> +   case 8: \
> +   asm volatile ("stlr %1, [%0]"   \
> +   : "=Q" (*p) : "r" (v) : "memory");  \
> +   break;  \
> +   }   \
> +} while (0)
> +
> +#define smp_load_acquire(p)\
> +({ \
> +   typeof(*p) ___p1;   \
> +   compiletime_assert_atomic_type(*p); \
> +   switch (sizeof(*p)) {   \
> +   case 4: \
> +   asm volatile ("ldar %w0, [%1]"  \
> +   : "=r" (___p1) : "Q" (*p) : "memory");  \
> +   break;  \
> +   case 8: \
> +   asm volatile ("ldar %0, [%1]"   \
> +   : "=r" (___p1) : "Q" (*p) : "memory");  \
> +   break;  \
> +   }   \
> +   ___p1;  \
> +})

You don't need the square brackets when using the "Q" constraint (otherwise
it will expand to something like [[x0]], which gas won't accept).

With those changes, for the general idea and arm/arm64 parts:

  Acked-by: Will Deacon 

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Will Deacon
Hi Peter,

Couple of minor fixes on the arm64 side...

On Wed, Nov 06, 2013 at 01:57:36PM +, Peter Zijlstra wrote:
 --- a/arch/arm64/include/asm/barrier.h
 +++ b/arch/arm64/include/asm/barrier.h
 @@ -35,11 +35,59 @@
  #define smp_mb()   barrier()
  #define smp_rmb()  barrier()
  #define smp_wmb()  barrier()
 +
 +#define smp_store_release(p, v)  
   \
 +do {   \
 +   compiletime_assert_atomic_type(*p); \
 +   smp_mb();   \
 +   ACCESS_ONCE(*p) = (v);  \
 +} while (0)
 +
 +#define smp_load_acquire(p)\
 +({ \
 +   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
 +   compiletime_assert_atomic_type(*p); \
 +   smp_mb();   \
 +   ___p1;  \
 +})
 +
  #else
 +
  #define smp_mb()   asm volatile(dmb ish : : : memory)
  #define smp_rmb()  asm volatile(dmb ishld : : : memory)
  #define smp_wmb()  asm volatile(dmb ishst : : : memory)
 -#endif

Why are you getting rid of this #endif?

 +#define smp_store_release(p, v)  
   \
 +do {   \
 +   compiletime_assert_atomic_type(*p); \
 +   switch (sizeof(*p)) {   \
 +   case 4: \
 +   asm volatile (stlr %w1, [%0]  \
 +   : =Q (*p) : r (v) : memory);  \
 +   break;  \
 +   case 8: \
 +   asm volatile (stlr %1, [%0]   \
 +   : =Q (*p) : r (v) : memory);  \
 +   break;  \
 +   }   \
 +} while (0)
 +
 +#define smp_load_acquire(p)\
 +({ \
 +   typeof(*p) ___p1;   \
 +   compiletime_assert_atomic_type(*p); \
 +   switch (sizeof(*p)) {   \
 +   case 4: \
 +   asm volatile (ldar %w0, [%1]  \
 +   : =r (___p1) : Q (*p) : memory);  \
 +   break;  \
 +   case 8: \
 +   asm volatile (ldar %0, [%1]   \
 +   : =r (___p1) : Q (*p) : memory);  \
 +   break;  \
 +   }   \
 +   ___p1;  \
 +})

You don't need the square brackets when using the Q constraint (otherwise
it will expand to something like [[x0]], which gas won't accept).

With those changes, for the general idea and arm/arm64 parts:

  Acked-by: Will Deacon will.dea...@arm.com

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Peter Zijlstra
On Thu, Nov 07, 2013 at 11:17:41AM +, Will Deacon wrote:
 Hi Peter,
 
 Couple of minor fixes on the arm64 side...
 
 On Wed, Nov 06, 2013 at 01:57:36PM +, Peter Zijlstra wrote:
  --- a/arch/arm64/include/asm/barrier.h
  +++ b/arch/arm64/include/asm/barrier.h
  @@ -35,11 +35,59 @@
   #define smp_mb()   barrier()
   #define smp_rmb()  barrier()
   #define smp_wmb()  barrier()
  +
  +#define smp_store_release(p, v)
  \
  +do {   \
  +   compiletime_assert_atomic_type(*p); \
  +   smp_mb();   \
  +   ACCESS_ONCE(*p) = (v);  \
  +} while (0)
  +
  +#define smp_load_acquire(p)\
  +({ \
  +   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
  +   compiletime_assert_atomic_type(*p); \
  +   smp_mb();   \
  +   ___p1;  \
  +})
  +
   #else
  +
   #define smp_mb()   asm volatile(dmb ish : : : memory)
   #define smp_rmb()  asm volatile(dmb ishld : : : memory)
   #define smp_wmb()  asm volatile(dmb ishst : : : memory)
  -#endif
 
 Why are you getting rid of this #endif?

oops..

  +#define smp_store_release(p, v)
  \
  +do {   \
  +   compiletime_assert_atomic_type(*p); \
  +   switch (sizeof(*p)) {   \
  +   case 4: \
  +   asm volatile (stlr %w1, [%0]  \
  +   : =Q (*p) : r (v) : memory);  \
  +   break;  \
  +   case 8: \
  +   asm volatile (stlr %1, [%0]   \
  +   : =Q (*p) : r (v) : memory);  \
  +   break;  \
  +   }   \
  +} while (0)
  +
  +#define smp_load_acquire(p)\
  +({ \
  +   typeof(*p) ___p1;   \
  +   compiletime_assert_atomic_type(*p); \
  +   switch (sizeof(*p)) {   \
  +   case 4: \
  +   asm volatile (ldar %w0, [%1]  \
  +   : =r (___p1) : Q (*p) : memory);  \
  +   break;  \
  +   case 8: \
  +   asm volatile (ldar %0, [%1]   \
  +   : =r (___p1) : Q (*p) : memory);  \
  +   break;  \
  +   }   \
  +   ___p1;  \
  +})
 
 You don't need the square brackets when using the Q constraint (otherwise
 it will expand to something like [[x0]], which gas won't accept).
 
 With those changes, for the general idea and arm/arm64 parts:
 
   Acked-by: Will Deacon will.dea...@arm.com

Thanks, I did that split-up I talked about yesterday, I was going to
compile them for all archs I have a compiler for before posting again.



---
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -62,11 +62,11 @@ do {
\
compiletime_assert_atomic_type(*p); \
switch (sizeof(*p)) {   \
case 4: \
-   asm volatile (stlr %w1, [%0]  \
+   asm volatile (stlr %w1, %0\
: =Q (*p) : r (v) : memory);  \
break;  \
case 8: \
-   asm volatile (stlr %1, [%0]   \
+   asm volatile (stlr %1, %0 \
 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Mathieu Desnoyers
* Peter Zijlstra (pet...@infradead.org) wrote:

[...]

Hi Peter,

Looking at this simplified version of perf's ring buffer
synchronization, I get concerned about the following issue:

 /*
  * One important detail is that the kbuf part and the kbuf_writer() are
  * strictly per cpu and we can thus rely on program order for those.
  *
  * Only the userspace consumer can possibly run on another cpu, and thus we
  * need to ensure data consistency for those.
  */
 
 struct buffer {
 u64 size;
 u64 tail;
 u64 head;
 void *data;
 };
 
 struct buffer *kbuf, *ubuf;
 
 /*
  * If there's space in the buffer; store the data @buf; otherwise
  * discard it.
  */
 void kbuf_write(int sz, void *buf)
 {
   u64 tail, head, offset;
 
   do {
   tail = ACCESS_ONCE(ubuf-tail);
   offset = head = kbuf-head;
   if (CIRC_SPACE(head, tail, kbuf-size)  sz) {
   /* discard @buf */
   return;
   }
   head += sz;
   } while (local_cmpxchg(kbuf-head, offset, head) != offset)
 

Let's suppose we have a thread executing kbuf_write(), interrupted by an
IRQ or NMI right after a successful local_cmpxchg() (space reservation
in the buffer). If the nested execution context also calls kbuf_write(),
it will therefore update ubuf-head (below) with the second reserved
space, and only after that will it return to the original thread context
and continue executing kbuf_write(), thus overwriting ubuf-head with
the prior-to-last reserved offset.

All this probably works OK most of the times, when we have an event flow
guaranteeing that a following event will fix things up, but there
appears to be a risk of losing events near the end of the trace when
those are in nested execution contexts.

Thoughts ?

Thanks,

Mathieu

 /*
  * Ensure that if we see the userspace tail (ubuf-tail) such
  * that there is space to write @buf without overwriting data
  * userspace hasn't seen yet, we won't in fact store data before
  * that read completes.
  */
 
 smp_mb(); /* A, matches with D */
 
 memcpy(kbuf-data + offset, buf, sz);
 
 /*
  * Ensure that we write all the @buf data before we update the
  * userspace visible ubuf-head pointer.
  */
 smp_wmb(); /* B, matches with C */
 
 ubuf-head = kbuf-head;
 }
 
 /*
  * Consume the buffer data and update the tail pointer to indicate to
  * kernel space there's 'free' space.
  */
 void ubuf_read(void)
 {
 u64 head, tail;
 
 tail = ACCESS_ONCE(ubuf-tail);
 head = ACCESS_ONCE(ubuf-head);
 
 /*
  * Ensure we read the buffer boundaries before the actual buffer
  * data...
  */
 smp_rmb(); /* C, matches with B */
 
 while (tail != head) {
 obj = ubuf-data + tail;
 /* process obj */
 tail += obj-size;
 tail %= ubuf-size;
 }
 
 /*
  * Ensure all data reads are complete before we issue the
  * ubuf-tail update; once that update hits, kbuf_write() can
  * observe and overwrite data.
  */
 smp_mb(); /* D, matches with A */
 
 ubuf-tail = tail;
 }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Peter Zijlstra
On Wed, Nov 06, 2013 at 10:48:48AM -0800, Paul E. McKenney wrote:
> A few nits on Documentation/memory-barriers.txt and some pointless
> comments elsewhere.  With the suggested Documentation/memory-barriers.txt
> fixes:
> 
> Reviewed-by: Paul E. McKenney 

Thanks, I think I'll cut the thing into a number of smaller patches with
identical end result. Will (hopefully) post a full new series tomorrow
somewhere.

I was thinking like:
 1 - aggressively employ asm-generic/barrier.h
 2 - Reformulate _The_ document to ACQUIRE/RELEASE
 3 - add the new store/load thingies

That should hopefully be slightly easier to look at.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Paul E. McKenney
On Wed, Nov 06, 2013 at 02:57:36PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2013 at 01:51:10PM +0100, Geert Uytterhoeven wrote:
> > This is screaming for a default implementation in asm-generic.
> 
> Right you are... how about a little something like this?
> 
> There's a few archs I didn't fully merge with the generic one because of
> weird nop implementations.
> 
> asm volatile ("nop" :: ) vs asm volatile ("nop" ::: "memory") and the
> like. They probably can (and should) use the regular asm volatile
> ("nop") but I misplaced the toolchains for many of the weird archs so I
> didn't attempt.
> 
> Also fixed a silly mistake in the return type definition for most
> smp_load_acquire() implementions: typeof(p) vs typeof(*p).
> 
> ---
> Subject: arch: Introduce smp_load_acquire(), smp_store_release()
> From: Peter Zijlstra 
> Date: Mon, 4 Nov 2013 20:18:11 +0100
> 
> A number of situations currently require the heavyweight smp_mb(),
> even though there is no need to order prior stores against later
> loads.  Many architectures have much cheaper ways to handle these
> situations, but the Linux kernel currently has no portable way
> to make use of them.
> 
> This commit therefore supplies smp_load_acquire() and
> smp_store_release() to remedy this situation.  The new
> smp_load_acquire() primitive orders the specified load against
> any subsequent reads or writes, while the new smp_store_release()
> primitive orders the specifed store against any prior reads or
> writes.  These primitives allow array-based circular FIFOs to be
> implemented without an smp_mb(), and also allow a theoretical
> hole in rcu_assign_pointer() to be closed at no additional
> expense on most architectures.
> 
> In addition, the RCU experience transitioning from explicit
> smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
> and rcu_assign_pointer(), respectively resulted in substantial
> improvements in readability.  It therefore seems likely that
> replacing other explicit barriers with smp_load_acquire() and
> smp_store_release() will provide similar benefits.  It appears
> that roughly half of the explicit barriers in core kernel code
> might be so replaced.
> 
> 
> Cc: Michael Ellerman 
> Cc: Michael Neuling 
> Cc: "Paul E. McKenney" 
> Cc: Linus Torvalds 
> Cc: Victor Kaplansky 
> Cc: Oleg Nesterov 
> Cc: Anton Blanchard 
> Cc: Benjamin Herrenschmidt 
> Cc: Frederic Weisbecker 
> Cc: Mathieu Desnoyers 
> Signed-off-by: Peter Zijlstra 

A few nits on Documentation/memory-barriers.txt and some pointless
comments elsewhere.  With the suggested Documentation/memory-barriers.txt
fixes:

Reviewed-by: Paul E. McKenney 

> ---
>  Documentation/memory-barriers.txt |  157 
> +-
>  arch/alpha/include/asm/barrier.h  |   25 +
>  arch/arc/include/asm/Kbuild   |1 
>  arch/arc/include/asm/atomic.h |5 +
>  arch/arc/include/asm/barrier.h|   42 -
>  arch/arm/include/asm/barrier.h|   15 +++
>  arch/arm64/include/asm/barrier.h  |   50 ++
>  arch/avr32/include/asm/barrier.h  |   17 +--
>  arch/blackfin/include/asm/barrier.h   |   18 ---
>  arch/cris/include/asm/Kbuild  |1 
>  arch/cris/include/asm/barrier.h   |   25 -
>  arch/frv/include/asm/barrier.h|8 -
>  arch/h8300/include/asm/barrier.h  |   21 
>  arch/hexagon/include/asm/Kbuild   |1 
>  arch/hexagon/include/asm/barrier.h|   41 
>  arch/ia64/include/asm/barrier.h   |   49 ++
>  arch/m32r/include/asm/barrier.h   |   80 -
>  arch/m68k/include/asm/barrier.h   |   14 ---
>  arch/metag/include/asm/barrier.h  |   15 +++
>  arch/microblaze/include/asm/Kbuild|1 
>  arch/microblaze/include/asm/barrier.h |   27 -
>  arch/mips/include/asm/barrier.h   |   15 +++
>  arch/mn10300/include/asm/Kbuild   |1 
>  arch/mn10300/include/asm/barrier.h|   37 
>  arch/parisc/include/asm/Kbuild|1 
>  arch/parisc/include/asm/barrier.h |   35 ---
>  arch/powerpc/include/asm/barrier.h|   21 
>  arch/s390/include/asm/barrier.h   |   15 +++
>  arch/score/include/asm/Kbuild |1 
>  arch/score/include/asm/barrier.h  |   16 ---
>  arch/sh/include/asm/barrier.h |   21 
>  arch/sparc/include/asm/barrier_32.h   |   11 --
>  arch/sparc/include/asm/barrier_64.h   |   15 +++
>  arch/tile/include/asm/barrier.h   |   68 --
>  arch/unicore32/include/asm/barrier.h  |   11 --
>  arch/x86/include/asm/barrier.h|   15 +++
>  arch/xtensa/include/asm/barrier.h |9 -
>  include/asm-generic/barrier.h |   55 +--
>  include/linux/compiler.h  |9 +
>  39 files changed, 375 insertions(+), 594 deletions(-)
> 
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
> 
>  And a couple of implicit 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Peter Zijlstra
On Wed, Nov 06, 2013 at 01:51:10PM +0100, Geert Uytterhoeven wrote:
> This is screaming for a default implementation in asm-generic.

Right you are... how about a little something like this?

There's a few archs I didn't fully merge with the generic one because of
weird nop implementations.

asm volatile ("nop" :: ) vs asm volatile ("nop" ::: "memory") and the
like. They probably can (and should) use the regular asm volatile
("nop") but I misplaced the toolchains for many of the weird archs so I
didn't attempt.

Also fixed a silly mistake in the return type definition for most
smp_load_acquire() implementions: typeof(p) vs typeof(*p).

---
Subject: arch: Introduce smp_load_acquire(), smp_store_release()
From: Peter Zijlstra 
Date: Mon, 4 Nov 2013 20:18:11 +0100

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.


Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: "Paul E. McKenney" 
Cc: Linus Torvalds 
Cc: Victor Kaplansky 
Cc: Oleg Nesterov 
Cc: Anton Blanchard 
Cc: Benjamin Herrenschmidt 
Cc: Frederic Weisbecker 
Cc: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra 
---
 Documentation/memory-barriers.txt |  157 +-
 arch/alpha/include/asm/barrier.h  |   25 +
 arch/arc/include/asm/Kbuild   |1 
 arch/arc/include/asm/atomic.h |5 +
 arch/arc/include/asm/barrier.h|   42 -
 arch/arm/include/asm/barrier.h|   15 +++
 arch/arm64/include/asm/barrier.h  |   50 ++
 arch/avr32/include/asm/barrier.h  |   17 +--
 arch/blackfin/include/asm/barrier.h   |   18 ---
 arch/cris/include/asm/Kbuild  |1 
 arch/cris/include/asm/barrier.h   |   25 -
 arch/frv/include/asm/barrier.h|8 -
 arch/h8300/include/asm/barrier.h  |   21 
 arch/hexagon/include/asm/Kbuild   |1 
 arch/hexagon/include/asm/barrier.h|   41 
 arch/ia64/include/asm/barrier.h   |   49 ++
 arch/m32r/include/asm/barrier.h   |   80 -
 arch/m68k/include/asm/barrier.h   |   14 ---
 arch/metag/include/asm/barrier.h  |   15 +++
 arch/microblaze/include/asm/Kbuild|1 
 arch/microblaze/include/asm/barrier.h |   27 -
 arch/mips/include/asm/barrier.h   |   15 +++
 arch/mn10300/include/asm/Kbuild   |1 
 arch/mn10300/include/asm/barrier.h|   37 
 arch/parisc/include/asm/Kbuild|1 
 arch/parisc/include/asm/barrier.h |   35 ---
 arch/powerpc/include/asm/barrier.h|   21 
 arch/s390/include/asm/barrier.h   |   15 +++
 arch/score/include/asm/Kbuild |1 
 arch/score/include/asm/barrier.h  |   16 ---
 arch/sh/include/asm/barrier.h |   21 
 arch/sparc/include/asm/barrier_32.h   |   11 --
 arch/sparc/include/asm/barrier_64.h   |   15 +++
 arch/tile/include/asm/barrier.h   |   68 --
 arch/unicore32/include/asm/barrier.h  |   11 --
 arch/x86/include/asm/barrier.h|   15 +++
 arch/xtensa/include/asm/barrier.h |9 -
 include/asm-generic/barrier.h |   55 +--
 include/linux/compiler.h  |9 +
 39 files changed, 375 insertions(+), 594 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
  This acts as a one-way permeable barrier.  It guarantees that all memory
- operations after the LOCK operation will appear to happen after the LOCK
- operation with respect to the other components of the system.
+ operations after the ACQUIRE operation will appear to happen after the
+ ACQUIRE operation with respect to the other components of the system.
 
- Memory 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Geert Uytterhoeven
On Wed, Nov 6, 2013 at 1:39 PM, Peter Zijlstra  wrote:
>  Documentation/memory-barriers.txt |  157 
> +-
>  arch/alpha/include/asm/barrier.h  |   15 +++
>  arch/arc/include/asm/barrier.h|   15 +++
>  arch/arm/include/asm/barrier.h|   15 +++
>  arch/arm64/include/asm/barrier.h  |   50 ++
>  arch/avr32/include/asm/barrier.h  |   14 +++
>  arch/blackfin/include/asm/barrier.h   |   15 +++
>  arch/cris/include/asm/barrier.h   |   15 +++
>  arch/frv/include/asm/barrier.h|   15 +++
>  arch/h8300/include/asm/barrier.h  |   15 +++
>  arch/hexagon/include/asm/barrier.h|   15 +++
>  arch/ia64/include/asm/barrier.h   |   49 ++
>  arch/m32r/include/asm/barrier.h   |   15 +++
>  arch/m68k/include/asm/barrier.h   |   15 +++
>  arch/metag/include/asm/barrier.h  |   15 +++
>  arch/microblaze/include/asm/barrier.h |   15 +++
>  arch/mips/include/asm/barrier.h   |   15 +++
>  arch/mn10300/include/asm/barrier.h|   15 +++
>  arch/parisc/include/asm/barrier.h |   15 +++
>  arch/powerpc/include/asm/barrier.h|   21 
>  arch/s390/include/asm/barrier.h   |   15 +++
>  arch/score/include/asm/barrier.h  |   15 +++
>  arch/sh/include/asm/barrier.h |   15 +++
>  arch/sparc/include/asm/barrier_32.h   |   15 +++
>  arch/sparc/include/asm/barrier_64.h   |   15 +++
>  arch/tile/include/asm/barrier.h   |   15 +++
>  arch/unicore32/include/asm/barrier.h  |   15 +++
>  arch/x86/include/asm/barrier.h|   15 +++
>  arch/xtensa/include/asm/barrier.h |   15 +++
>  include/linux/compiler.h  |9 +
>  30 files changed, 581 insertions(+), 79 deletions(-)

This is screaming for a default implementation in asm-generic.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Peter Zijlstra

Subject: arch: Introduce smp_load_acquire(), smp_store_release()
From: Peter Zijlstra 
Date: Mon, 4 Nov 2013 20:18:11 +0100

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.

Cc: Frederic Weisbecker 
Cc: Mathieu Desnoyers 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: "Paul E. McKenney" 
Cc: Linus Torvalds 
Cc: Victor Kaplansky 
Cc: Oleg Nesterov 
Cc: Anton Blanchard 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Peter Zijlstra 
---
 Documentation/memory-barriers.txt |  157 +-
 arch/alpha/include/asm/barrier.h  |   15 +++
 arch/arc/include/asm/barrier.h|   15 +++
 arch/arm/include/asm/barrier.h|   15 +++
 arch/arm64/include/asm/barrier.h  |   50 ++
 arch/avr32/include/asm/barrier.h  |   14 +++
 arch/blackfin/include/asm/barrier.h   |   15 +++
 arch/cris/include/asm/barrier.h   |   15 +++
 arch/frv/include/asm/barrier.h|   15 +++
 arch/h8300/include/asm/barrier.h  |   15 +++
 arch/hexagon/include/asm/barrier.h|   15 +++
 arch/ia64/include/asm/barrier.h   |   49 ++
 arch/m32r/include/asm/barrier.h   |   15 +++
 arch/m68k/include/asm/barrier.h   |   15 +++
 arch/metag/include/asm/barrier.h  |   15 +++
 arch/microblaze/include/asm/barrier.h |   15 +++
 arch/mips/include/asm/barrier.h   |   15 +++
 arch/mn10300/include/asm/barrier.h|   15 +++
 arch/parisc/include/asm/barrier.h |   15 +++
 arch/powerpc/include/asm/barrier.h|   21 
 arch/s390/include/asm/barrier.h   |   15 +++
 arch/score/include/asm/barrier.h  |   15 +++
 arch/sh/include/asm/barrier.h |   15 +++
 arch/sparc/include/asm/barrier_32.h   |   15 +++
 arch/sparc/include/asm/barrier_64.h   |   15 +++
 arch/tile/include/asm/barrier.h   |   15 +++
 arch/unicore32/include/asm/barrier.h  |   15 +++
 arch/x86/include/asm/barrier.h|   15 +++
 arch/xtensa/include/asm/barrier.h |   15 +++
 include/linux/compiler.h  |9 +
 30 files changed, 581 insertions(+), 79 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
  This acts as a one-way permeable barrier.  It guarantees that all memory
- operations after the LOCK operation will appear to happen after the LOCK
- operation with respect to the other components of the system.
+ operations after the ACQUIRE operation will appear to happen after the
+ ACQUIRE operation with respect to the other components of the system.
 
- Memory operations that occur before a LOCK operation may appear to happen
- after it completes.
+ Memory operations that occur before a ACQUIRE operation may appear to
+ happen after it completes.
 
- A LOCK operation should almost always be paired with an UNLOCK operation.
+ A ACQUIRE operation should almost always be paired with an RELEASE
+ operation.
 
 
- (6) UNLOCK operations.
+ (6) RELEASE operations.
 
  This also acts as a one-way permeable barrier.  It guarantees that all
- memory operations before the UNLOCK operation will appear to happen before
- the UNLOCK operation with respect to the other components of the system.
+ memory operations before the RELEASE operation will appear to happen
+ before the RELEASE operation with respect to the other components of the
+ system.
 
- Memory operations that occur after an UNLOCK operation may appear to
+ Memory operations that occur after an RELEASE operation may appear to
  happen before it completes.
 
- LOCK and UNLOCK operations are guaranteed to appear with respect to each
- other strictly in the order 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Will Deacon
On Tue, Nov 05, 2013 at 06:49:43PM +, Peter Zijlstra wrote:
> On Tue, Nov 05, 2013 at 02:05:48PM +, Will Deacon wrote:
> > > > +
> > > > +#define smp_store_release(p, v)
> > > >   \
> > > > +do { \
> > > > + smp_mb();   \
> > > > + ACCESS_ONCE(p) = (v);   \
> > > > +} while (0)
> > > > +
> > > > +#define smp_load_acquire(p, v) 
> > > >   \
> > > > +do { \
> > > > + typeof(p) ___p1 = ACCESS_ONCE(p);   \
> > > > + smp_mb();   \
> > > > + return ___p1;   \
> > > > +} while (0)
> > 
> > What data sizes do these accessors operate on? Assuming that we want
> > single-copy atomicity (with respect to interrupts in the UP case), we
> > probably want a check to stop people passing in things like structs.
> 
> Fair enough; I think we should restrict to native word sizes same as we
> do for atomics.
> 
> Something like so perhaps:
> 
> #ifdef CONFIG_64BIT
> #define __check_native_word(t)(sizeof(t) == 4 || sizeof(t) == 8)

Ok, if we want to support 32-bit accesses on 64-bit machines, that will
complicate some of your assembly (more below).

> #else
> #define __check_native_word(t)(sizeof(t) == 4)
> #endif
> 
> #define smp_store_release(p, v)   \
> do {  \
>   BUILD_BUG_ON(!__check_native_word(p));  \
>   smp_mb();   \
>   ACCESS_ONCE(p) = (v);   \
> } while (0)
> 
> > > > +#define smp_store_release(p, v)
> > > >   \
> > > > +do { \
> > > > + asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" () : "memory");\
> > 
> > Missing comma between the operands. Also, that 'w' output modifier enforces
> > a 32-bit store (same early question about sizes). Finally, it might be more
> > efficient to use "=Q" for the addressing mode, rather than take the address
> > of p manually.
> 
> so something like:
> 
>   asm volatile ("stlr %0, [%1]" : : "r" (v), "=Q" (p) : "memory");
> 
> ?
> 
> My inline asm foo is horrid and I mostly get by with copy paste from a
> semi similar existing form :/

Almost: you just need to drop the square brackets and make the memory
location an output operand:

asm volatile("stlr %1, %0" : "=Q" (p) : "r" (v) : "memory");

however, for a 32-bit access, you need to use an output modifier:

asm volatile("stlr %w1, %0" : "=Q" (p) : "r" (v) : "memory");

so I guess a switch on sizeof(p) is required.

> > Random other question: have you considered how these accessors should behave
> > when presented with __iomem pointers?
> 
> A what? ;-)

Then let's go with Paul's suggestion of mandating __kernel, or the like
(unless we need to worry about __user addresses for things like futexes?).

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Will Deacon
On Tue, Nov 05, 2013 at 06:49:43PM +, Peter Zijlstra wrote:
 On Tue, Nov 05, 2013 at 02:05:48PM +, Will Deacon wrote:
+
+#define smp_store_release(p, v)
  \
+do { \
+ smp_mb();   \
+ ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p, v) 
  \
+do { \
+ typeof(p) ___p1 = ACCESS_ONCE(p);   \
+ smp_mb();   \
+ return ___p1;   \
+} while (0)
  
  What data sizes do these accessors operate on? Assuming that we want
  single-copy atomicity (with respect to interrupts in the UP case), we
  probably want a check to stop people passing in things like structs.
 
 Fair enough; I think we should restrict to native word sizes same as we
 do for atomics.
 
 Something like so perhaps:
 
 #ifdef CONFIG_64BIT
 #define __check_native_word(t)(sizeof(t) == 4 || sizeof(t) == 8)

Ok, if we want to support 32-bit accesses on 64-bit machines, that will
complicate some of your assembly (more below).

 #else
 #define __check_native_word(t)(sizeof(t) == 4)
 #endif
 
 #define smp_store_release(p, v)   \
 do {  \
   BUILD_BUG_ON(!__check_native_word(p));  \
   smp_mb();   \
   ACCESS_ONCE(p) = (v);   \
 } while (0)
 
+#define smp_store_release(p, v)
  \
+do { \
+ asm volatile (stlr %w0 [%1] : : r (v), r (p) : memory);\
  
  Missing comma between the operands. Also, that 'w' output modifier enforces
  a 32-bit store (same early question about sizes). Finally, it might be more
  efficient to use =Q for the addressing mode, rather than take the address
  of p manually.
 
 so something like:
 
   asm volatile (stlr %0, [%1] : : r (v), =Q (p) : memory);
 
 ?
 
 My inline asm foo is horrid and I mostly get by with copy paste from a
 semi similar existing form :/

Almost: you just need to drop the square brackets and make the memory
location an output operand:

asm volatile(stlr %1, %0 : =Q (p) : r (v) : memory);

however, for a 32-bit access, you need to use an output modifier:

asm volatile(stlr %w1, %0 : =Q (p) : r (v) : memory);

so I guess a switch on sizeof(p) is required.

  Random other question: have you considered how these accessors should behave
  when presented with __iomem pointers?
 
 A what? ;-)

Then let's go with Paul's suggestion of mandating __kernel, or the like
(unless we need to worry about __user addresses for things like futexes?).

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Peter Zijlstra

Subject: arch: Introduce smp_load_acquire(), smp_store_release()
From: Peter Zijlstra pet...@infradead.org
Date: Mon, 4 Nov 2013 20:18:11 +0100

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.

Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Mathieu Desnoyers mathieu.desnoy...@polymtl.ca
Cc: Michael Ellerman mich...@ellerman.id.au
Cc: Michael Neuling mi...@neuling.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Victor Kaplansky vict...@il.ibm.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Anton Blanchard an...@samba.org
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 Documentation/memory-barriers.txt |  157 +-
 arch/alpha/include/asm/barrier.h  |   15 +++
 arch/arc/include/asm/barrier.h|   15 +++
 arch/arm/include/asm/barrier.h|   15 +++
 arch/arm64/include/asm/barrier.h  |   50 ++
 arch/avr32/include/asm/barrier.h  |   14 +++
 arch/blackfin/include/asm/barrier.h   |   15 +++
 arch/cris/include/asm/barrier.h   |   15 +++
 arch/frv/include/asm/barrier.h|   15 +++
 arch/h8300/include/asm/barrier.h  |   15 +++
 arch/hexagon/include/asm/barrier.h|   15 +++
 arch/ia64/include/asm/barrier.h   |   49 ++
 arch/m32r/include/asm/barrier.h   |   15 +++
 arch/m68k/include/asm/barrier.h   |   15 +++
 arch/metag/include/asm/barrier.h  |   15 +++
 arch/microblaze/include/asm/barrier.h |   15 +++
 arch/mips/include/asm/barrier.h   |   15 +++
 arch/mn10300/include/asm/barrier.h|   15 +++
 arch/parisc/include/asm/barrier.h |   15 +++
 arch/powerpc/include/asm/barrier.h|   21 
 arch/s390/include/asm/barrier.h   |   15 +++
 arch/score/include/asm/barrier.h  |   15 +++
 arch/sh/include/asm/barrier.h |   15 +++
 arch/sparc/include/asm/barrier_32.h   |   15 +++
 arch/sparc/include/asm/barrier_64.h   |   15 +++
 arch/tile/include/asm/barrier.h   |   15 +++
 arch/unicore32/include/asm/barrier.h  |   15 +++
 arch/x86/include/asm/barrier.h|   15 +++
 arch/xtensa/include/asm/barrier.h |   15 +++
 include/linux/compiler.h  |9 +
 30 files changed, 581 insertions(+), 79 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
  This acts as a one-way permeable barrier.  It guarantees that all memory
- operations after the LOCK operation will appear to happen after the LOCK
- operation with respect to the other components of the system.
+ operations after the ACQUIRE operation will appear to happen after the
+ ACQUIRE operation with respect to the other components of the system.
 
- Memory operations that occur before a LOCK operation may appear to happen
- after it completes.
+ Memory operations that occur before a ACQUIRE operation may appear to
+ happen after it completes.
 
- A LOCK operation should almost always be paired with an UNLOCK operation.
+ A ACQUIRE operation should almost always be paired with an RELEASE
+ operation.
 
 
- (6) UNLOCK operations.
+ (6) RELEASE operations.
 
  This also acts as a one-way permeable barrier.  It guarantees that all
- memory operations before the UNLOCK operation will appear to happen before
- the UNLOCK operation with respect to the other components of the system.
+ memory operations before the RELEASE operation will appear to happen
+ before the RELEASE operation with respect to the other components of the
+ system.
 
- Memory operations that occur after an UNLOCK 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Geert Uytterhoeven
On Wed, Nov 6, 2013 at 1:39 PM, Peter Zijlstra pet...@infradead.org wrote:
  Documentation/memory-barriers.txt |  157 
 +-
  arch/alpha/include/asm/barrier.h  |   15 +++
  arch/arc/include/asm/barrier.h|   15 +++
  arch/arm/include/asm/barrier.h|   15 +++
  arch/arm64/include/asm/barrier.h  |   50 ++
  arch/avr32/include/asm/barrier.h  |   14 +++
  arch/blackfin/include/asm/barrier.h   |   15 +++
  arch/cris/include/asm/barrier.h   |   15 +++
  arch/frv/include/asm/barrier.h|   15 +++
  arch/h8300/include/asm/barrier.h  |   15 +++
  arch/hexagon/include/asm/barrier.h|   15 +++
  arch/ia64/include/asm/barrier.h   |   49 ++
  arch/m32r/include/asm/barrier.h   |   15 +++
  arch/m68k/include/asm/barrier.h   |   15 +++
  arch/metag/include/asm/barrier.h  |   15 +++
  arch/microblaze/include/asm/barrier.h |   15 +++
  arch/mips/include/asm/barrier.h   |   15 +++
  arch/mn10300/include/asm/barrier.h|   15 +++
  arch/parisc/include/asm/barrier.h |   15 +++
  arch/powerpc/include/asm/barrier.h|   21 
  arch/s390/include/asm/barrier.h   |   15 +++
  arch/score/include/asm/barrier.h  |   15 +++
  arch/sh/include/asm/barrier.h |   15 +++
  arch/sparc/include/asm/barrier_32.h   |   15 +++
  arch/sparc/include/asm/barrier_64.h   |   15 +++
  arch/tile/include/asm/barrier.h   |   15 +++
  arch/unicore32/include/asm/barrier.h  |   15 +++
  arch/x86/include/asm/barrier.h|   15 +++
  arch/xtensa/include/asm/barrier.h |   15 +++
  include/linux/compiler.h  |9 +
  30 files changed, 581 insertions(+), 79 deletions(-)

This is screaming for a default implementation in asm-generic.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Peter Zijlstra
On Wed, Nov 06, 2013 at 01:51:10PM +0100, Geert Uytterhoeven wrote:
 This is screaming for a default implementation in asm-generic.

Right you are... how about a little something like this?

There's a few archs I didn't fully merge with the generic one because of
weird nop implementations.

asm volatile (nop :: ) vs asm volatile (nop ::: memory) and the
like. They probably can (and should) use the regular asm volatile
(nop) but I misplaced the toolchains for many of the weird archs so I
didn't attempt.

Also fixed a silly mistake in the return type definition for most
smp_load_acquire() implementions: typeof(p) vs typeof(*p).

---
Subject: arch: Introduce smp_load_acquire(), smp_store_release()
From: Peter Zijlstra pet...@infradead.org
Date: Mon, 4 Nov 2013 20:18:11 +0100

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.


Cc: Michael Ellerman mich...@ellerman.id.au
Cc: Michael Neuling mi...@neuling.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Victor Kaplansky vict...@il.ibm.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Anton Blanchard an...@samba.org
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Mathieu Desnoyers mathieu.desnoy...@polymtl.ca
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 Documentation/memory-barriers.txt |  157 +-
 arch/alpha/include/asm/barrier.h  |   25 +
 arch/arc/include/asm/Kbuild   |1 
 arch/arc/include/asm/atomic.h |5 +
 arch/arc/include/asm/barrier.h|   42 -
 arch/arm/include/asm/barrier.h|   15 +++
 arch/arm64/include/asm/barrier.h  |   50 ++
 arch/avr32/include/asm/barrier.h  |   17 +--
 arch/blackfin/include/asm/barrier.h   |   18 ---
 arch/cris/include/asm/Kbuild  |1 
 arch/cris/include/asm/barrier.h   |   25 -
 arch/frv/include/asm/barrier.h|8 -
 arch/h8300/include/asm/barrier.h  |   21 
 arch/hexagon/include/asm/Kbuild   |1 
 arch/hexagon/include/asm/barrier.h|   41 
 arch/ia64/include/asm/barrier.h   |   49 ++
 arch/m32r/include/asm/barrier.h   |   80 -
 arch/m68k/include/asm/barrier.h   |   14 ---
 arch/metag/include/asm/barrier.h  |   15 +++
 arch/microblaze/include/asm/Kbuild|1 
 arch/microblaze/include/asm/barrier.h |   27 -
 arch/mips/include/asm/barrier.h   |   15 +++
 arch/mn10300/include/asm/Kbuild   |1 
 arch/mn10300/include/asm/barrier.h|   37 
 arch/parisc/include/asm/Kbuild|1 
 arch/parisc/include/asm/barrier.h |   35 ---
 arch/powerpc/include/asm/barrier.h|   21 
 arch/s390/include/asm/barrier.h   |   15 +++
 arch/score/include/asm/Kbuild |1 
 arch/score/include/asm/barrier.h  |   16 ---
 arch/sh/include/asm/barrier.h |   21 
 arch/sparc/include/asm/barrier_32.h   |   11 --
 arch/sparc/include/asm/barrier_64.h   |   15 +++
 arch/tile/include/asm/barrier.h   |   68 --
 arch/unicore32/include/asm/barrier.h  |   11 --
 arch/x86/include/asm/barrier.h|   15 +++
 arch/xtensa/include/asm/barrier.h |9 -
 include/asm-generic/barrier.h |   55 +--
 include/linux/compiler.h  |9 +
 39 files changed, 375 insertions(+), 594 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
  This acts as a one-way permeable barrier.  It guarantees that all memory
- operations after the LOCK operation will appear to happen after the LOCK
-  

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Paul E. McKenney
On Wed, Nov 06, 2013 at 02:57:36PM +0100, Peter Zijlstra wrote:
 On Wed, Nov 06, 2013 at 01:51:10PM +0100, Geert Uytterhoeven wrote:
  This is screaming for a default implementation in asm-generic.
 
 Right you are... how about a little something like this?
 
 There's a few archs I didn't fully merge with the generic one because of
 weird nop implementations.
 
 asm volatile (nop :: ) vs asm volatile (nop ::: memory) and the
 like. They probably can (and should) use the regular asm volatile
 (nop) but I misplaced the toolchains for many of the weird archs so I
 didn't attempt.
 
 Also fixed a silly mistake in the return type definition for most
 smp_load_acquire() implementions: typeof(p) vs typeof(*p).
 
 ---
 Subject: arch: Introduce smp_load_acquire(), smp_store_release()
 From: Peter Zijlstra pet...@infradead.org
 Date: Mon, 4 Nov 2013 20:18:11 +0100
 
 A number of situations currently require the heavyweight smp_mb(),
 even though there is no need to order prior stores against later
 loads.  Many architectures have much cheaper ways to handle these
 situations, but the Linux kernel currently has no portable way
 to make use of them.
 
 This commit therefore supplies smp_load_acquire() and
 smp_store_release() to remedy this situation.  The new
 smp_load_acquire() primitive orders the specified load against
 any subsequent reads or writes, while the new smp_store_release()
 primitive orders the specifed store against any prior reads or
 writes.  These primitives allow array-based circular FIFOs to be
 implemented without an smp_mb(), and also allow a theoretical
 hole in rcu_assign_pointer() to be closed at no additional
 expense on most architectures.
 
 In addition, the RCU experience transitioning from explicit
 smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
 and rcu_assign_pointer(), respectively resulted in substantial
 improvements in readability.  It therefore seems likely that
 replacing other explicit barriers with smp_load_acquire() and
 smp_store_release() will provide similar benefits.  It appears
 that roughly half of the explicit barriers in core kernel code
 might be so replaced.
 
 
 Cc: Michael Ellerman mich...@ellerman.id.au
 Cc: Michael Neuling mi...@neuling.org
 Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
 Cc: Linus Torvalds torva...@linux-foundation.org
 Cc: Victor Kaplansky vict...@il.ibm.com
 Cc: Oleg Nesterov o...@redhat.com
 Cc: Anton Blanchard an...@samba.org
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: Mathieu Desnoyers mathieu.desnoy...@polymtl.ca
 Signed-off-by: Peter Zijlstra pet...@infradead.org

A few nits on Documentation/memory-barriers.txt and some pointless
comments elsewhere.  With the suggested Documentation/memory-barriers.txt
fixes:

Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

 ---
  Documentation/memory-barriers.txt |  157 
 +-
  arch/alpha/include/asm/barrier.h  |   25 +
  arch/arc/include/asm/Kbuild   |1 
  arch/arc/include/asm/atomic.h |5 +
  arch/arc/include/asm/barrier.h|   42 -
  arch/arm/include/asm/barrier.h|   15 +++
  arch/arm64/include/asm/barrier.h  |   50 ++
  arch/avr32/include/asm/barrier.h  |   17 +--
  arch/blackfin/include/asm/barrier.h   |   18 ---
  arch/cris/include/asm/Kbuild  |1 
  arch/cris/include/asm/barrier.h   |   25 -
  arch/frv/include/asm/barrier.h|8 -
  arch/h8300/include/asm/barrier.h  |   21 
  arch/hexagon/include/asm/Kbuild   |1 
  arch/hexagon/include/asm/barrier.h|   41 
  arch/ia64/include/asm/barrier.h   |   49 ++
  arch/m32r/include/asm/barrier.h   |   80 -
  arch/m68k/include/asm/barrier.h   |   14 ---
  arch/metag/include/asm/barrier.h  |   15 +++
  arch/microblaze/include/asm/Kbuild|1 
  arch/microblaze/include/asm/barrier.h |   27 -
  arch/mips/include/asm/barrier.h   |   15 +++
  arch/mn10300/include/asm/Kbuild   |1 
  arch/mn10300/include/asm/barrier.h|   37 
  arch/parisc/include/asm/Kbuild|1 
  arch/parisc/include/asm/barrier.h |   35 ---
  arch/powerpc/include/asm/barrier.h|   21 
  arch/s390/include/asm/barrier.h   |   15 +++
  arch/score/include/asm/Kbuild |1 
  arch/score/include/asm/barrier.h  |   16 ---
  arch/sh/include/asm/barrier.h |   21 
  arch/sparc/include/asm/barrier_32.h   |   11 --
  arch/sparc/include/asm/barrier_64.h   |   15 +++
  arch/tile/include/asm/barrier.h   |   68 --
  arch/unicore32/include/asm/barrier.h  |   11 --
  arch/x86/include/asm/barrier.h|   15 +++
  arch/xtensa/include/asm/barrier.h |9 -
  include/asm-generic/barrier.h |   55 +--
  include/linux/compiler.h  |9 +
  39 files changed, 375 insertions(+), 594 deletions(-)
 
 --- 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-06 Thread Peter Zijlstra
On Wed, Nov 06, 2013 at 10:48:48AM -0800, Paul E. McKenney wrote:
 A few nits on Documentation/memory-barriers.txt and some pointless
 comments elsewhere.  With the suggested Documentation/memory-barriers.txt
 fixes:
 
 Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

Thanks, I think I'll cut the thing into a number of smaller patches with
identical end result. Will (hopefully) post a full new series tomorrow
somewhere.

I was thinking like:
 1 - aggressively employ asm-generic/barrier.h
 2 - Reformulate _The_ document to ACQUIRE/RELEASE
 3 - add the new store/load thingies

That should hopefully be slightly easier to look at.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-05 Thread Peter Zijlstra
On Tue, Nov 05, 2013 at 02:05:48PM +, Will Deacon wrote:
> > > +
> > > +#define smp_store_release(p, v)  
> > > \
> > > +do { \
> > > + smp_mb();   \
> > > + ACCESS_ONCE(p) = (v);   \
> > > +} while (0)
> > > +
> > > +#define smp_load_acquire(p, v)   
> > > \
> > > +do { \
> > > + typeof(p) ___p1 = ACCESS_ONCE(p);   \
> > > + smp_mb();   \
> > > + return ___p1;   \
> > > +} while (0)
> 
> What data sizes do these accessors operate on? Assuming that we want
> single-copy atomicity (with respect to interrupts in the UP case), we
> probably want a check to stop people passing in things like structs.

Fair enough; I think we should restrict to native word sizes same as we
do for atomics.

Something like so perhaps:

#ifdef CONFIG_64BIT
#define __check_native_word(t)  (sizeof(t) == 4 || sizeof(t) == 8)
#else
#define __check_native_word(t)  (sizeof(t) == 4)
#endif

#define smp_store_release(p, v) \
do {\
BUILD_BUG_ON(!__check_native_word(p));  \
smp_mb();   \
ACCESS_ONCE(p) = (v);   \
} while (0)

> > > +#define smp_store_release(p, v)  
> > > \
> > > +do { \
> > > + asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" () : "memory");\
> 
> Missing comma between the operands. Also, that 'w' output modifier enforces
> a 32-bit store (same early question about sizes). Finally, it might be more
> efficient to use "=Q" for the addressing mode, rather than take the address
> of p manually.

so something like:

asm volatile ("stlr %0, [%1]" : : "r" (v), "=Q" (p) : "memory");

?

My inline asm foo is horrid and I mostly get by with copy paste from a
semi similar existing form :/

> Random other question: have you considered how these accessors should behave
> when presented with __iomem pointers?

A what? ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-05 Thread Paul E. McKenney
On Tue, Nov 05, 2013 at 02:05:48PM +, Will Deacon wrote:
> On Mon, Nov 04, 2013 at 08:53:44PM +, Paul E. McKenney wrote:
> > On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> > Some comments below.  I believe that opcodes need to be fixed for IA64.
> > I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
> > be able to tell us.

[ . . . ]

> > > +} while (0)
> > > +
> > > +#define smp_load_acquire(p)  \
> > > +do { \
> > > + typeof(p) ___p1;\
> > > + asm volatile ("ldar %w0, [%1]"  \
> > > + : "=r" (___p1) : "r" () : "memory");  \
> > > + return ___p1;   \
> 
> Similar comments here wrt Q constraint.
> 
> Random other question: have you considered how these accessors should behave
> when presented with __iomem pointers?

Should we have something to make sparse yell if not __kernel or some such?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-05 Thread Will Deacon
On Mon, Nov 04, 2013 at 08:53:44PM +, Paul E. McKenney wrote:
> On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> Some comments below.  I believe that opcodes need to be fixed for IA64.
> I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
> be able to tell us.

[...]

> > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > index 60f15e274e6d..a804093d6891 100644
> > --- a/arch/arm/include/asm/barrier.h
> > +++ b/arch/arm/include/asm/barrier.h
> > @@ -53,10 +53,36 @@
> >  #define smp_mb() barrier()
> >  #define smp_rmb()barrier()
> >  #define smp_wmb()barrier()
> > +
> > +#define smp_store_release(p, v)
> >   \
> > +do { \
> > + smp_mb();   \
> > + ACCESS_ONCE(p) = (v);   \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p, v) 
> >   \
> > +do { \
> > + typeof(p) ___p1 = ACCESS_ONCE(p);   \
> > + smp_mb();   \
> > + return ___p1;   \
> > +} while (0)

What data sizes do these accessors operate on? Assuming that we want
single-copy atomicity (with respect to interrupts in the UP case), we
probably want a check to stop people passing in things like structs.

> >  #else
> >  #define smp_mb() dmb(ish)
> >  #define smp_rmb()smp_mb()
> >  #define smp_wmb()dmb(ishst)
> > +
> 
> Seems like there should be some sort of #ifdef condition to distinguish
> between these.  My guess is something like:
> 
> #if __LINUX_ARM_ARCH__ > 7
> 
> But I must defer to the ARM guys.  For all I know, they might prefer
> arch/arm to stick with smp_mb() and have arch/arm64 do the ldar and stlr.

Yes. For arch/arm/, I'd rather we stick with the smp_mb() for the time
being. We don't (yet) have any 32-bit ARMv8 support, and the efforts towards
a single zImage could do without minor variations like this, not to mention
the usual backlash I get whenever introducing something that needs a
relatively recent binutils.

> > +#define smp_store_release(p, v)
> >   \
> > +do { \
> > + asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" () : "memory");\
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)  \
> > +do { \
> > + typeof(p) ___p1;\
> > + asm volatile ("ldar %w0, [%1]"  \
> > + : "=r" (___p1) : "r" () : "memory");  \
> > + return ___p1;   \
> > +} while (0)
> >  #endif
> >
> >  #define read_barrier_depends()   do { } while(0)
> > diff --git a/arch/arm64/include/asm/barrier.h 
> > b/arch/arm64/include/asm/barrier.h
> > index d4a63338a53c..0da2d4ebb9a8 100644
> > --- a/arch/arm64/include/asm/barrier.h
> > +++ b/arch/arm64/include/asm/barrier.h
> > @@ -35,10 +35,38 @@
> >  #define smp_mb() barrier()
> >  #define smp_rmb()barrier()
> >  #define smp_wmb()barrier()
> > +
> > +#define smp_store_release(p, v)
> >   \
> > +do { \
> > + smp_mb();   \
> > + ACCESS_ONCE(p) = (v);   \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p, v) 
> >   \
> > +do { \
> > + typeof(p) ___p1 = ACCESS_ONCE(p);   \
> > + smp_mb();   \
> > + return ___p1;   \
> > +} while (0)
> > +
> >  #else
> > +
> >  #define smp_mb() asm volatile("dmb ish" : : : "memory")
> >  #define smp_rmb()asm volatile("dmb ishld" : : : "memory")
> >  #define smp_wmb()asm volatile("dmb ishst" : : : "memory")
> > +
> > +#define smp_store_release(p, v)
> >   \
> > +do { \
> > + asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" () : "memory");\

Missing comma between the operands. Also, that 'w' output modifier enforces
a 32-bit store (same early question about sizes). Finally, it might be more
efficient to use "=Q" for the addressing mode, rather than take the address
of p 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-05 Thread Will Deacon
On Mon, Nov 04, 2013 at 08:53:44PM +, Paul E. McKenney wrote:
 On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
 Some comments below.  I believe that opcodes need to be fixed for IA64.
 I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
 be able to tell us.

[...]

  diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
  index 60f15e274e6d..a804093d6891 100644
  --- a/arch/arm/include/asm/barrier.h
  +++ b/arch/arm/include/asm/barrier.h
  @@ -53,10 +53,36 @@
   #define smp_mb() barrier()
   #define smp_rmb()barrier()
   #define smp_wmb()barrier()
  +
  +#define smp_store_release(p, v)
\
  +do { \
  + smp_mb();   \
  + ACCESS_ONCE(p) = (v);   \
  +} while (0)
  +
  +#define smp_load_acquire(p, v) 
\
  +do { \
  + typeof(p) ___p1 = ACCESS_ONCE(p);   \
  + smp_mb();   \
  + return ___p1;   \
  +} while (0)

What data sizes do these accessors operate on? Assuming that we want
single-copy atomicity (with respect to interrupts in the UP case), we
probably want a check to stop people passing in things like structs.

   #else
   #define smp_mb() dmb(ish)
   #define smp_rmb()smp_mb()
   #define smp_wmb()dmb(ishst)
  +
 
 Seems like there should be some sort of #ifdef condition to distinguish
 between these.  My guess is something like:
 
 #if __LINUX_ARM_ARCH__  7
 
 But I must defer to the ARM guys.  For all I know, they might prefer
 arch/arm to stick with smp_mb() and have arch/arm64 do the ldar and stlr.

Yes. For arch/arm/, I'd rather we stick with the smp_mb() for the time
being. We don't (yet) have any 32-bit ARMv8 support, and the efforts towards
a single zImage could do without minor variations like this, not to mention
the usual backlash I get whenever introducing something that needs a
relatively recent binutils.

  +#define smp_store_release(p, v)
\
  +do { \
  + asm volatile (stlr %w0 [%1] : : r (v), r (p) : memory);\
  +} while (0)
  +
  +#define smp_load_acquire(p)  \
  +do { \
  + typeof(p) ___p1;\
  + asm volatile (ldar %w0, [%1]  \
  + : =r (___p1) : r (p) : memory);  \
  + return ___p1;   \
  +} while (0)
   #endif
 
   #define read_barrier_depends()   do { } while(0)
  diff --git a/arch/arm64/include/asm/barrier.h 
  b/arch/arm64/include/asm/barrier.h
  index d4a63338a53c..0da2d4ebb9a8 100644
  --- a/arch/arm64/include/asm/barrier.h
  +++ b/arch/arm64/include/asm/barrier.h
  @@ -35,10 +35,38 @@
   #define smp_mb() barrier()
   #define smp_rmb()barrier()
   #define smp_wmb()barrier()
  +
  +#define smp_store_release(p, v)
\
  +do { \
  + smp_mb();   \
  + ACCESS_ONCE(p) = (v);   \
  +} while (0)
  +
  +#define smp_load_acquire(p, v) 
\
  +do { \
  + typeof(p) ___p1 = ACCESS_ONCE(p);   \
  + smp_mb();   \
  + return ___p1;   \
  +} while (0)
  +
   #else
  +
   #define smp_mb() asm volatile(dmb ish : : : memory)
   #define smp_rmb()asm volatile(dmb ishld : : : memory)
   #define smp_wmb()asm volatile(dmb ishst : : : memory)
  +
  +#define smp_store_release(p, v)
\
  +do { \
  + asm volatile (stlr %w0 [%1] : : r (v), r (p) : memory);\

Missing comma between the operands. Also, that 'w' output modifier enforces
a 32-bit store (same early question about sizes). Finally, it might be more
efficient to use =Q for the addressing mode, rather than take the address
of p manually.

  +} while (0)
  +
  +#define smp_load_acquire(p)  \
  +do { \
  + typeof(p) ___p1;  

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-05 Thread Paul E. McKenney
On Tue, Nov 05, 2013 at 02:05:48PM +, Will Deacon wrote:
 On Mon, Nov 04, 2013 at 08:53:44PM +, Paul E. McKenney wrote:
  On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
  Some comments below.  I believe that opcodes need to be fixed for IA64.
  I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
  be able to tell us.

[ . . . ]

   +} while (0)
   +
   +#define smp_load_acquire(p)  \
   +do { \
   + typeof(p) ___p1;\
   + asm volatile (ldar %w0, [%1]  \
   + : =r (___p1) : r (p) : memory);  \
   + return ___p1;   \
 
 Similar comments here wrt Q constraint.
 
 Random other question: have you considered how these accessors should behave
 when presented with __iomem pointers?

Should we have something to make sparse yell if not __kernel or some such?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-05 Thread Peter Zijlstra
On Tue, Nov 05, 2013 at 02:05:48PM +, Will Deacon wrote:
   +
   +#define smp_store_release(p, v)  
   \
   +do { \
   + smp_mb();   \
   + ACCESS_ONCE(p) = (v);   \
   +} while (0)
   +
   +#define smp_load_acquire(p, v)   
   \
   +do { \
   + typeof(p) ___p1 = ACCESS_ONCE(p);   \
   + smp_mb();   \
   + return ___p1;   \
   +} while (0)
 
 What data sizes do these accessors operate on? Assuming that we want
 single-copy atomicity (with respect to interrupts in the UP case), we
 probably want a check to stop people passing in things like structs.

Fair enough; I think we should restrict to native word sizes same as we
do for atomics.

Something like so perhaps:

#ifdef CONFIG_64BIT
#define __check_native_word(t)  (sizeof(t) == 4 || sizeof(t) == 8)
#else
#define __check_native_word(t)  (sizeof(t) == 4)
#endif

#define smp_store_release(p, v) \
do {\
BUILD_BUG_ON(!__check_native_word(p));  \
smp_mb();   \
ACCESS_ONCE(p) = (v);   \
} while (0)

   +#define smp_store_release(p, v)  
   \
   +do { \
   + asm volatile (stlr %w0 [%1] : : r (v), r (p) : memory);\
 
 Missing comma between the operands. Also, that 'w' output modifier enforces
 a 32-bit store (same early question about sizes). Finally, it might be more
 efficient to use =Q for the addressing mode, rather than take the address
 of p manually.

so something like:

asm volatile (stlr %0, [%1] : : r (v), =Q (p) : memory);

?

My inline asm foo is horrid and I mostly get by with copy paste from a
semi similar existing form :/

 Random other question: have you considered how these accessors should behave
 when presented with __iomem pointers?

A what? ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 08:18:11PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> > +#define smp_load_acquire(p, v) 
> > \
> 
> I R idiot!! :-)

OK, I did miss this one as well...  :-/

Thanx, Paul

> ---
>  arch/alpha/include/asm/barrier.h  | 13 +++
>  arch/arc/include/asm/barrier.h| 13 +++
>  arch/arm/include/asm/barrier.h| 26 +
>  arch/arm64/include/asm/barrier.h  | 28 +++
>  arch/avr32/include/asm/barrier.h  | 12 ++
>  arch/blackfin/include/asm/barrier.h   | 13 +++
>  arch/cris/include/asm/barrier.h   | 13 +++
>  arch/frv/include/asm/barrier.h| 13 +++
>  arch/h8300/include/asm/barrier.h  | 13 +++
>  arch/hexagon/include/asm/barrier.h| 13 +++
>  arch/ia64/include/asm/barrier.h   | 43 
> +++
>  arch/m32r/include/asm/barrier.h   | 13 +++
>  arch/m68k/include/asm/barrier.h   | 13 +++
>  arch/metag/include/asm/barrier.h  | 13 +++
>  arch/microblaze/include/asm/barrier.h | 13 +++
>  arch/mips/include/asm/barrier.h   | 13 +++
>  arch/mn10300/include/asm/barrier.h| 13 +++
>  arch/parisc/include/asm/barrier.h | 13 +++
>  arch/powerpc/include/asm/barrier.h| 15 
>  arch/s390/include/asm/barrier.h   | 13 +++
>  arch/score/include/asm/barrier.h  | 13 +++
>  arch/sh/include/asm/barrier.h | 13 +++
>  arch/sparc/include/asm/barrier_32.h   | 13 +++
>  arch/sparc/include/asm/barrier_64.h   | 13 +++
>  arch/tile/include/asm/barrier.h   | 13 +++
>  arch/unicore32/include/asm/barrier.h  | 13 +++
>  arch/x86/include/asm/barrier.h| 13 +++
>  arch/xtensa/include/asm/barrier.h | 13 +++
>  28 files changed, 423 insertions(+)
> 
> diff --git a/arch/alpha/include/asm/barrier.h 
> b/arch/alpha/include/asm/barrier.h
> index ce8860a0b32d..464139feee97 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
>  #define smp_read_barrier_depends()   do { } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)  
> \
> +do { \
> + smp_mb();   \
> + ACCESS_ONCE(p) = (v);   \
> +} while (0)
> +
> +#define smp_load_acquire(p)  \
> +do { \
> + typeof(p) ___p1 = ACCESS_ONCE(p);   \
> + smp_mb();   \
> + return ___p1;   \
> +} while (0)
> +
>  #define set_mb(var, value) \
>  do { var = value; mb(); } while (0)
> 
> diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> index f6cb7c4ffb35..a779da846fb5 100644
> --- a/arch/arc/include/asm/barrier.h
> +++ b/arch/arc/include/asm/barrier.h
> @@ -30,6 +30,19 @@
>  #define smp_wmb()   barrier()
>  #endif
> 
> +#define smp_store_release(p, v)  
> \
> +do { \
> + smp_mb();   \
> + ACCESS_ONCE(p) = (v);   \
> +} while (0)
> +
> +#define smp_load_acquire(p)  \
> +do { \
> + typeof(p) ___p1 = ACCESS_ONCE(p);   \
> + smp_mb();   \
> + return ___p1;   \
> +} while (0)
> +
>  #define smp_mb__before_atomic_dec()  barrier()
>  #define smp_mb__after_atomic_dec()   barrier()
>  #define smp_mb__before_atomic_inc()  barrier()
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 60f15e274e6d..4ada4720bdeb 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,10 +53,36 @@
>  #define smp_mb() barrier()
>  #define smp_rmb()barrier()
>  #define smp_wmb()barrier()
> +
> +#define smp_store_release(p, v)  
> \
> +do { \
> + smp_mb();   \
> + ACCESS_ONCE(p) = (v);   \
> +} while (0)
> +
> +#define 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
> > All this is leading me to suggest the following shortenings of names:
> > 
> > smp_load_with_acquire_semantics() -> smp_load_acquire()
> > 
> > smp_store_with_release_semantics() -> smp_store_release()
> > 
> > But names aside, the above gets rid of explicit barriers on TSO 
> > architectures,
> > allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
> > the heavier-weight sync.
> 
> A little something like this? Completely guessed at the arm/arm64/ia64
> asm, but at least for those archs I found proper instructions (I hope),
> for x86,sparc,s390 which are TSO we can do with a barrier and PPC like
> said can do with the lwsync, all others fall back to using a smp_mb().
> 
> Should probably come with a proper changelog and an addition to _The_
> document.

Maybe something like this for the changelog?

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.

Some comments below.  I believe that opcodes need to be fixed for IA64.
I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
be able to tell us.

Other than that, for the rest:

Reviewed-by: Paul E. McKenney 

> ---
>  arch/alpha/include/asm/barrier.h  | 13 +++
>  arch/arc/include/asm/barrier.h| 13 +++
>  arch/arm/include/asm/barrier.h| 26 +
>  arch/arm64/include/asm/barrier.h  | 28 +++
>  arch/avr32/include/asm/barrier.h  | 12 ++
>  arch/blackfin/include/asm/barrier.h   | 13 +++
>  arch/cris/include/asm/barrier.h   | 13 +++
>  arch/frv/include/asm/barrier.h| 13 +++
>  arch/h8300/include/asm/barrier.h  | 13 +++
>  arch/hexagon/include/asm/barrier.h| 13 +++
>  arch/ia64/include/asm/barrier.h   | 43 
> +++
>  arch/m32r/include/asm/barrier.h   | 13 +++
>  arch/m68k/include/asm/barrier.h   | 13 +++
>  arch/metag/include/asm/barrier.h  | 13 +++
>  arch/microblaze/include/asm/barrier.h | 13 +++
>  arch/mips/include/asm/barrier.h   | 13 +++
>  arch/mn10300/include/asm/barrier.h| 13 +++
>  arch/parisc/include/asm/barrier.h | 13 +++
>  arch/powerpc/include/asm/barrier.h| 15 
>  arch/s390/include/asm/barrier.h   | 13 +++
>  arch/score/include/asm/barrier.h  | 13 +++
>  arch/sh/include/asm/barrier.h | 13 +++
>  arch/sparc/include/asm/barrier_32.h   | 13 +++
>  arch/sparc/include/asm/barrier_64.h   | 13 +++
>  arch/tile/include/asm/barrier.h   | 13 +++
>  arch/unicore32/include/asm/barrier.h  | 13 +++
>  arch/x86/include/asm/barrier.h| 13 +++
>  arch/xtensa/include/asm/barrier.h | 13 +++
>  28 files changed, 423 insertions(+)
> 
> diff --git a/arch/alpha/include/asm/barrier.h 
> b/arch/alpha/include/asm/barrier.h
> index ce8860a0b32d..464139feee97 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
>  #define smp_read_barrier_depends()   do { } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)  
> \
> +do { \
> + smp_mb();   \
> + ACCESS_ONCE(p) = (v); 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> +#define smp_load_acquire(p, v)   
> \

I R idiot!! :-)

---
 arch/alpha/include/asm/barrier.h  | 13 +++
 arch/arc/include/asm/barrier.h| 13 +++
 arch/arm/include/asm/barrier.h| 26 +
 arch/arm64/include/asm/barrier.h  | 28 +++
 arch/avr32/include/asm/barrier.h  | 12 ++
 arch/blackfin/include/asm/barrier.h   | 13 +++
 arch/cris/include/asm/barrier.h   | 13 +++
 arch/frv/include/asm/barrier.h| 13 +++
 arch/h8300/include/asm/barrier.h  | 13 +++
 arch/hexagon/include/asm/barrier.h| 13 +++
 arch/ia64/include/asm/barrier.h   | 43 +++
 arch/m32r/include/asm/barrier.h   | 13 +++
 arch/m68k/include/asm/barrier.h   | 13 +++
 arch/metag/include/asm/barrier.h  | 13 +++
 arch/microblaze/include/asm/barrier.h | 13 +++
 arch/mips/include/asm/barrier.h   | 13 +++
 arch/mn10300/include/asm/barrier.h| 13 +++
 arch/parisc/include/asm/barrier.h | 13 +++
 arch/powerpc/include/asm/barrier.h| 15 
 arch/s390/include/asm/barrier.h   | 13 +++
 arch/score/include/asm/barrier.h  | 13 +++
 arch/sh/include/asm/barrier.h | 13 +++
 arch/sparc/include/asm/barrier_32.h   | 13 +++
 arch/sparc/include/asm/barrier_64.h   | 13 +++
 arch/tile/include/asm/barrier.h   | 13 +++
 arch/unicore32/include/asm/barrier.h  | 13 +++
 arch/x86/include/asm/barrier.h| 13 +++
 arch/xtensa/include/asm/barrier.h | 13 +++
 28 files changed, 423 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..464139feee97 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_read_barrier_depends() do { } while (0)
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..a779da846fb5 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,19 @@
 #define smp_wmb()   barrier()
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define smp_mb__before_atomic_dec()barrier()
 #define smp_mb__after_atomic_dec() barrier()
 #define smp_mb__before_atomic_inc()barrier()
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..4ada4720bdeb 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,10 +53,36 @@
 #define smp_mb()   barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
+
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
> All this is leading me to suggest the following shortenings of names:
> 
>   smp_load_with_acquire_semantics() -> smp_load_acquire()
> 
>   smp_store_with_release_semantics() -> smp_store_release()
> 
> But names aside, the above gets rid of explicit barriers on TSO architectures,
> allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
> the heavier-weight sync.

A little something like this? Completely guessed at the arm/arm64/ia64
asm, but at least for those archs I found proper instructions (I hope),
for x86,sparc,s390 which are TSO we can do with a barrier and PPC like
said can do with the lwsync, all others fall back to using a smp_mb().

Should probably come with a proper changelog and an addition to _The_
document.

---
 arch/alpha/include/asm/barrier.h  | 13 +++
 arch/arc/include/asm/barrier.h| 13 +++
 arch/arm/include/asm/barrier.h| 26 +
 arch/arm64/include/asm/barrier.h  | 28 +++
 arch/avr32/include/asm/barrier.h  | 12 ++
 arch/blackfin/include/asm/barrier.h   | 13 +++
 arch/cris/include/asm/barrier.h   | 13 +++
 arch/frv/include/asm/barrier.h| 13 +++
 arch/h8300/include/asm/barrier.h  | 13 +++
 arch/hexagon/include/asm/barrier.h| 13 +++
 arch/ia64/include/asm/barrier.h   | 43 +++
 arch/m32r/include/asm/barrier.h   | 13 +++
 arch/m68k/include/asm/barrier.h   | 13 +++
 arch/metag/include/asm/barrier.h  | 13 +++
 arch/microblaze/include/asm/barrier.h | 13 +++
 arch/mips/include/asm/barrier.h   | 13 +++
 arch/mn10300/include/asm/barrier.h| 13 +++
 arch/parisc/include/asm/barrier.h | 13 +++
 arch/powerpc/include/asm/barrier.h| 15 
 arch/s390/include/asm/barrier.h   | 13 +++
 arch/score/include/asm/barrier.h  | 13 +++
 arch/sh/include/asm/barrier.h | 13 +++
 arch/sparc/include/asm/barrier_32.h   | 13 +++
 arch/sparc/include/asm/barrier_64.h   | 13 +++
 arch/tile/include/asm/barrier.h   | 13 +++
 arch/unicore32/include/asm/barrier.h  | 13 +++
 arch/x86/include/asm/barrier.h| 13 +++
 arch/xtensa/include/asm/barrier.h | 13 +++
 28 files changed, 423 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..464139feee97 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_read_barrier_depends() do { } while (0)
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..a779da846fb5 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,19 @@
 #define smp_wmb()   barrier()
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define smp_mb__before_atomic_dec()barrier()
 #define smp_mb__after_atomic_dec() barrier()
 #define smp_mb__before_atomic_inc()barrier()
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..a804093d6891 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,10 +53,36 @@
 #define smp_mb()   barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 11:05:53AM +, Will Deacon wrote:
> On Sun, Nov 03, 2013 at 11:34:00PM +, Linus Torvalds wrote:
> > So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
> > problem is that a "smp_rmb()" doesn't really "attach" to the preceding
> > write.
> 
> Agreed.
> 
> > This is analogous to a "acquire" operation: you cannot make an
> > "acquire" barrier, because it's not a barrier *between* two ops, it's
> > associated with one particular op.
> > 
> > So what I *think* you actually really really want is a "store with
> > release consistency, followed by a write barrier".
> 
> How does that order reads against reads? (Paul mentioned this as a
> requirement). I not clear about the use case for this, so perhaps there is a
> dependency that I'm not aware of.

An smp_store_with_release_semantics() orders against prior reads -and-
writes.  It maps to barrier() for x86, stlr for ARM, and lwsync for
PowerPC, as called out in my prototype definitions.

> > In TSO, afaik all stores have release consistency, and all writes are
> > ordered, which is why this is a no-op in TSO. And x86 also has that
> > "all stores have release consistency, and all writes are ordered"
> > model, even if TSO doesn't really describe the x86 model.
> > 
> > But on ARM64, for example, I think you'd really want the store itself
> > to be done with "stlr" (store with release), and then follow up with a
> > "dsb st" after that.
> 
> So a dsb is pretty heavyweight here (it prevents execution of *any* further
> instructions until all preceeding stores have completed, as well as
> ensuring completion of any ongoing cache flushes). In conjunction with the
> store-release, that's going to hold everything up until the store-release
> (and therefore any preceeding memory accesses) have completed. Granted, I
> think that gives Paul his read/read ordering, but it's a lot heavier than
> what's required.

I do not believe that we need the trailing "dsb st".

> > And notice how that requires you to mark the store itself. There is no
> > actual barrier *after* the store that does the optimized model.
> > 
> > Of course, it's entirely possible that it's not worth worrying about
> > this on ARM64, and that just doing it as a "normal store followed by a
> > full memory barrier" is good enough. But at least in *theory* a
> > microarchitecture might make it much cheaper to do a "store with
> > release consistency" followed by "write barrier".
> 
> I agree with the sentiment but, given that this stuff is so heavily
> microarchitecture-dependent (and not simple to probe), a simple dmb ish
> might be the best option after all. That's especially true if the
> microarchitecture decided to ignore the barrier options and treat everything
> as `all accesses, full system' in order to keep the hardware design simple.

I believe that we can do quite a bit better with current hardware
instructions (in the case of ARM, for a recent definition of "current")
and also simplify the memory ordering quite a bit.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
> > 
> > 
> > /*
> >  * One important detail is that the kbuf part and the kbuf_writer() are
> >  * strictly per cpu and we can thus rely on program order for those.
> >  *
> >  * Only the userspace consumer can possibly run on another cpu, and thus we
> >  * need to ensure data consistency for those.
> >  */
> > 
> > struct buffer {
> > u64 size;
> > u64 tail;
> > u64 head;
> > void *data;
> > };
> > 
> > struct buffer *kbuf, *ubuf;
> > 
> > /*
> >  * If there's space in the buffer; store the data @buf; otherwise
> >  * discard it.
> >  */
> > void kbuf_write(int sz, void *buf)
> > {
> > u64 tail, head, offset;
> > 
> > do {
> > tail = ACCESS_ONCE(ubuf->tail);
> 
> So the above load is the key load.  It determines whether or not we
> have space in the buffer.  This of course assumes that only this CPU
> writes to ->head.

This assumption is true.

> If so, then:
> 
>   tail = smp_load_with_acquire_semantics(ubuf->tail); /* A -> D */
> 

OK, the way I understand ACQUIRE semantics are the semi-permeable LOCK
semantics from Documetnation/memory-barriers.txt. In which case the
relevant STORES below could be hoisted up here, but not across the READ,
which I suppose is sufficient.

> > offset = head = kbuf->head;
> > if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
> > /* discard @buf */
> > return;
> > }
> > head += sz;
> > } while (local_cmpxchg(>head, offset, head) != offset)
> 
> If there is an issue with kbuf->head, presumably local_cmpxchg() fails
> and we retry.
> 
> But sheesh, do you think we could have buried the definitions of
> local_cmpxchg() under a few more layers of macro expansion just to
> keep things more obscure?  Anyway, griping aside...
> 
> o __cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h
>   doesn't seem to exclude NMIs, so is not safe for this usage.
> 
> o __cmpxchg_local() in ARM handles NMI as long as the
>   argument is 32 bits, otherwise, it uses the aforementionted
>   __cmpxchg_local_generic(), which does not handle NMI.  Given your
>   u64, this does not look good...
> 
>   And some ARM subarches (e.g., metag) seem to fail to handle NMI
>   even in the 32-bit case.
> 
> o FRV and M32r seem to act similar to ARM.
> 
> Or maybe these architectures don't do NMIs?  If they do, local_cmpxchg()
> does not seem to be safe against NMIs in general.  :-/
> 
> That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it.

Ah my bad, so the in-kernel kbuf variant uses unsigned long, which on
all archs should be the native words size and cover the address space.

Only the public variant (ubuf) is u64 wide to not change data structure
layout on compat etc.

I suppose this was a victim on the simplification :/

And in case of 32bit the upper word will always be zero and the partial
reads should all work out just fine.

> Of course, x86's local_cmpxchg() has full memory barriers implicitly.

Not quite, the 'lock' in __raw_cmpxchg() expands to "" due to
__cmpxchg_local(), etc.. 

> > 
> > /*
> >  * Ensure that if we see the userspace tail (ubuf->tail) such
> >  * that there is space to write @buf without overwriting data
> >  * userspace hasn't seen yet, we won't in fact store data before
> >  * that read completes.
> >  */
> > 
> > smp_mb(); /* A, matches with D */
> 
> Given a change to smp_load_with_acquire_semantics() above, you should not
> need this smp_mb().

Because the STORES can not be hoisted across the ACQUIRE, indeed.

> 
> > memcpy(kbuf->data + offset, buf, sz);
> > 
> > /*
> >  * Ensure that we write all the @buf data before we update the
> >  * userspace visible ubuf->head pointer.
> >  */
> > smp_wmb(); /* B, matches with C */
> > 
> > ubuf->head = kbuf->head;
> 
> Replace the smp_wmb() and the assignment with:
> 
>   smp_store_with_release_semantics(ubuf->head, kbuf->head); /* B -> C */

And here the RELEASE semantics I assume are the same as the
semi-permeable UNLOCK from _The_ document? In which case the above
STORES cannot be lowered across this store and all should, again, be
well.

> > }
> > 
> > /*
> >  * Consume the buffer data and update the tail pointer to indicate to
> >  * kernel space there's 'free' space.
> >  */
> > void ubuf_read(void)
> > {
> > u64 head, tail;
> > 
> > tail = ACCESS_ONCE(ubuf->tail);
> 
> Does anyone else write tail?  Or is this defense against NMIs?

No, we're the sole writer; just general paranoia. Not sure the actual
userspace does this; /me checks. Nope, distinct lack of ACCESS_ONCE()
there, just the rmb(), which including a barrier() should hopefully
accomplish similar things most of the time ;-)

I'll need to introduce ACCESS_ONCE to 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 12:22:54PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote:
> > OK, something like this for the definitions (though PowerPC might want
> > to locally abstract the lwsync expansion):
> > 
> > #define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
> > do { \
> > barrier(); \
> > ACCESS_ONCE(p) = (v); \
> > } while (0)
> > 
> > #define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
> > do { \
> > __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
> > ACCESS_ONCE(p) = (v); \
> > } while (0)
> > 
> > #define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
> > ({ \
> > typeof(*p) *_p1 = ACCESS_ONCE(p); \
> > barrier(); \
> > _p1; \
> > })
> > 
> > #define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
> > ({ \
> > typeof(*p) *_p1 = ACCESS_ONCE(p); \
> > __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
> > _p1; \
> > })
> > 
> > For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
> > "ldar" instruction and smp_store_with_release_semantics() is a wrapper
> > around the ARM "stlr" instruction.
> 
> This still leaves me confused as to what to do with my case :/
> 
> Slightly modified since last time -- as the simplified version was maybe
> simplified too far.
> 
> To recap, I'd like to get rid of barrier A where possible, since that's
> now a full barrier for every event written.
> 
> However, there's no immediate store I can attach it to; the obvious one
> would be the kbuf->head store, but that's complicated by the
> local_cmpxchg() thing.
> 
> And we need that cmpxchg loop because a hardware NMI event can
> interleave with a software event.
> 
> And to be honest, I'm still totally confused about memory barriers vs
> control flow vs C/C++. The only way we're ever getting to that memcpy is
> if we've already observed ubuf->tail, so that LOAD has to be fully
> processes and completed.
> 
> I'm really not seeing how a STORE from the memcpy() could possibly go
> wrong; and if C/C++ can hoist the memcpy() over a compiler barrier()
> then I suppose we should all just go home.
> 
> /me who wants A to be a barrier() but is terminally confused.

Well, let's see...

> ---
> 
> 
> /*
>  * One important detail is that the kbuf part and the kbuf_writer() are
>  * strictly per cpu and we can thus rely on program order for those.
>  *
>  * Only the userspace consumer can possibly run on another cpu, and thus we
>  * need to ensure data consistency for those.
>  */
> 
> struct buffer {
> u64 size;
> u64 tail;
> u64 head;
> void *data;
> };
> 
> struct buffer *kbuf, *ubuf;
> 
> /*
>  * If there's space in the buffer; store the data @buf; otherwise
>  * discard it.
>  */
> void kbuf_write(int sz, void *buf)
> {
>   u64 tail, head, offset;
> 
>   do {
>   tail = ACCESS_ONCE(ubuf->tail);

So the above load is the key load.  It determines whether or not we
have space in the buffer.  This of course assumes that only this CPU
writes to ->head.

If so, then:

tail = smp_load_with_acquire_semantics(ubuf->tail); /* A -> D */

>   offset = head = kbuf->head;
>   if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
>   /* discard @buf */
>   return;
>   }
>   head += sz;
>   } while (local_cmpxchg(>head, offset, head) != offset)

If there is an issue with kbuf->head, presumably local_cmpxchg() fails
and we retry.

But sheesh, do you think we could have buried the definitions of
local_cmpxchg() under a few more layers of macro expansion just to
keep things more obscure?  Anyway, griping aside...

o   __cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h
doesn't seem to exclude NMIs, so is not safe for this usage.

o   __cmpxchg_local() in ARM handles NMI as long as the
argument is 32 bits, otherwise, it uses the aforementionted
__cmpxchg_local_generic(), which does not handle NMI.  Given your
u64, this does not look good...

And some ARM subarches (e.g., metag) seem to fail to handle NMI
even in the 32-bit case.

o   FRV and M32r seem to act similar to ARM.

Or maybe these architectures don't do NMIs?  If they do, local_cmpxchg()
does not seem to be safe against NMIs in general.  :-/

That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it.

Of course, x86's local_cmpxchg() has full memory barriers implicitly.

> 
> /*
>  * Ensure that if we see the userspace tail (ubuf->tail) such
>  * that there is space to write @buf without overwriting data
>  * userspace hasn't seen yet, we won't in fact store data before
>  * that 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote:
> OK, something like this for the definitions (though PowerPC might want
> to locally abstract the lwsync expansion):
> 
>   #define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
>   do { \
>   barrier(); \
>   ACCESS_ONCE(p) = (v); \
>   } while (0)
> 
>   #define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
>   do { \
>   __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
>   ACCESS_ONCE(p) = (v); \
>   } while (0)
> 
>   #define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
>   ({ \
>   typeof(*p) *_p1 = ACCESS_ONCE(p); \
>   barrier(); \
>   _p1; \
>   })
> 
>   #define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
>   ({ \
>   typeof(*p) *_p1 = ACCESS_ONCE(p); \
>   __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
>   _p1; \
>   })
> 
> For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
> "ldar" instruction and smp_store_with_release_semantics() is a wrapper
> around the ARM "stlr" instruction.

This still leaves me confused as to what to do with my case :/

Slightly modified since last time -- as the simplified version was maybe
simplified too far.

To recap, I'd like to get rid of barrier A where possible, since that's
now a full barrier for every event written.

However, there's no immediate store I can attach it to; the obvious one
would be the kbuf->head store, but that's complicated by the
local_cmpxchg() thing.

And we need that cmpxchg loop because a hardware NMI event can
interleave with a software event.

And to be honest, I'm still totally confused about memory barriers vs
control flow vs C/C++. The only way we're ever getting to that memcpy is
if we've already observed ubuf->tail, so that LOAD has to be fully
processes and completed.

I'm really not seeing how a STORE from the memcpy() could possibly go
wrong; and if C/C++ can hoist the memcpy() over a compiler barrier()
then I suppose we should all just go home.

/me who wants A to be a barrier() but is terminally confused.

---


/*
 * One important detail is that the kbuf part and the kbuf_writer() are
 * strictly per cpu and we can thus rely on program order for those.
 *
 * Only the userspace consumer can possibly run on another cpu, and thus we
 * need to ensure data consistency for those.
 */

struct buffer {
u64 size;
u64 tail;
u64 head;
void *data;
};

struct buffer *kbuf, *ubuf;

/*
 * If there's space in the buffer; store the data @buf; otherwise
 * discard it.
 */
void kbuf_write(int sz, void *buf)
{
u64 tail, head, offset;

do {
tail = ACCESS_ONCE(ubuf->tail);
offset = head = kbuf->head;
if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
/* discard @buf */
return;
}
head += sz;
} while (local_cmpxchg(>head, offset, head) != offset)

/*
 * Ensure that if we see the userspace tail (ubuf->tail) such
 * that there is space to write @buf without overwriting data
 * userspace hasn't seen yet, we won't in fact store data before
 * that read completes.
 */

smp_mb(); /* A, matches with D */

memcpy(kbuf->data + offset, buf, sz);

/*
 * Ensure that we write all the @buf data before we update the
 * userspace visible ubuf->head pointer.
 */
smp_wmb(); /* B, matches with C */

ubuf->head = kbuf->head;
}

/*
 * Consume the buffer data and update the tail pointer to indicate to
 * kernel space there's 'free' space.
 */
void ubuf_read(void)
{
u64 head, tail;

tail = ACCESS_ONCE(ubuf->tail);
head = ACCESS_ONCE(ubuf->head);

/*
 * Ensure we read the buffer boundaries before the actual buffer
 * data...
 */
smp_rmb(); /* C, matches with B */

while (tail != head) {
obj = ubuf->data + tail;
/* process obj */
tail += obj->size;
tail %= ubuf->size;
}

/*
 * Ensure all data reads are complete before we issue the
 * ubuf->tail update; once that update hits, kbuf_write() can
 * observe and overwrite data.
 */
smp_mb(); /* D, matches with A */

ubuf->tail = tail;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Will Deacon
On Sun, Nov 03, 2013 at 11:34:00PM +, Linus Torvalds wrote:
> So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
> problem is that a "smp_rmb()" doesn't really "attach" to the preceding
> write.

Agreed.

> This is analogous to a "acquire" operation: you cannot make an
> "acquire" barrier, because it's not a barrier *between* two ops, it's
> associated with one particular op.
> 
> So what I *think* you actually really really want is a "store with
> release consistency, followed by a write barrier".

How does that order reads against reads? (Paul mentioned this as a
requirement). I not clear about the use case for this, so perhaps there is a
dependency that I'm not aware of.

> In TSO, afaik all stores have release consistency, and all writes are
> ordered, which is why this is a no-op in TSO. And x86 also has that
> "all stores have release consistency, and all writes are ordered"
> model, even if TSO doesn't really describe the x86 model.
> 
> But on ARM64, for example, I think you'd really want the store itself
> to be done with "stlr" (store with release), and then follow up with a
> "dsb st" after that.

So a dsb is pretty heavyweight here (it prevents execution of *any* further
instructions until all preceeding stores have completed, as well as
ensuring completion of any ongoing cache flushes). In conjunction with the
store-release, that's going to hold everything up until the store-release
(and therefore any preceeding memory accesses) have completed. Granted, I
think that gives Paul his read/read ordering, but it's a lot heavier than
what's required.

> And notice how that requires you to mark the store itself. There is no
> actual barrier *after* the store that does the optimized model.
> 
> Of course, it's entirely possible that it's not worth worrying about
> this on ARM64, and that just doing it as a "normal store followed by a
> full memory barrier" is good enough. But at least in *theory* a
> microarchitecture might make it much cheaper to do a "store with
> release consistency" followed by "write barrier".

I agree with the sentiment but, given that this stuff is so heavily
microarchitecture-dependent (and not simple to probe), a simple dmb ish
might be the best option after all. That's especially true if the
microarchitecture decided to ignore the barrier options and treat everything
as `all accesses, full system' in order to keep the hardware design simple.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Sun, Nov 03, 2013 at 03:34:00PM -0800, Linus Torvalds wrote:
> On Sun, Nov 3, 2013 at 2:42 PM, Paul E. McKenney
>  wrote:
> >
> > smp_storebuffer_mb() -- A barrier that enforces those orderings
> > that do not invalidate the hardware store-buffer optimization.
> 
> Ugh. Maybe. Can you guarantee that those are the correct semantics?
> And why talk about the hardware semantics, when you really want
> specific semantics for the *software*.
> 
> > smp_not_w_r_mb() -- A barrier that orders everything except prior
> > writes against subsequent reads.
> 
> Ok, that sounds more along the lines of "these are the semantics we
> want", but I have to say, it also doesn't make me go "ahh, ok".
> 
> > smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
> > semantics.  (C/C++ "acquire" orders a specific load against
> > subsequent loads and stores, while C/C++ "release" orders
> > a specific store against prior loads and stores.)
> 
> I don't think this is true. acquire+release is much stronger than what
> you're looking for - it doesn't allow subsequent reads to move past
> the write (because that would violate the acquire part). On x86, for
> example, you'd need to have a locked cycle for smp_acqrel_mb().
> 
> So again, what are the guarantees you actually want? Describe those.
> And then make a name.

I was thinking in terms of the guarantee that TSO systems provide
given a barrier() directive, and that PowerPC provides given the lwsync
instruction.  This guarantee is that loads preceding the barrier will
not be reordered with memory referenced following the barrier, and that
stores preceding the barrier will not be reordered with stores following
the barrier.  But given how much easier RCU reviews became after burying
smp_wmb() and smp_read_barrier_depends() into rcu_assign_pointer() and
rcu_dereference(), respectively, I think I prefer an extension of your
idea below.

> I _think_ the guarantees you want is:
>  - SMP write barrier
>  - *local* read barrier for reads preceding the write.
> 
> but the problem is that the "preceding reads" part is really
> specifically about the write that you had. The barrier should really
> be attached to the *particular* write operation, it cannot be a
> standalone barrier.

Indeed, neither rcu_assign_pointer() nor the circular queue really needs a
standalone barrier, so that attaching the barrier to a particular memory
reference would work.  And as you note below, in the case of ARM this
would turn into one of their new memory-reference instructions.

> So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
> problem is that a "smp_rmb()" doesn't really "attach" to the preceding
> write.
> 
> This is analogous to a "acquire" operation: you cannot make an
> "acquire" barrier, because it's not a barrier *between* two ops, it's
> associated with one particular op.

But you -could- use any barrier that prevented reordering of any preceding
load with any subsequent memory reference.  Please note that I am -not-
advocating this anymore, because I like the idea of attaching the barrier
to a particular memory operation.  However, for completeness, here it is
in the case of TSO systems and PowerPC, respectively:

#define smp_acquire_mb() barrier();

#define smp_acquire_mb() \
__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory");

This functions correctly, but is a pain to review because you have to
figure out which of many possible preceding loads the smp_acquire_mb()
is supposed to attach to.  As you say, it is -way- better to attach the
barrier to a particular memory operation.

> So what I *think* you actually really really want is a "store with
> release consistency, followed by a write barrier".

I believe that the combination of "store with release consistency" and
"load with acquire consistency" should do the trick for the two use cases
at this point, which again are circular buffers and rcu_assign_pointer().
At this point, I don't see the need for "followed by a write barrier".
But I step through the circular buffers below.

> In TSO, afaik all stores have release consistency, and all writes are
> ordered, which is why this is a no-op in TSO. And x86 also has that
> "all stores have release consistency, and all writes are ordered"
> model, even if TSO doesn't really describe the x86 model.

Yep, as does the mainframe.  And these architectures also have all reads
having acquire consistency.

> But on ARM64, for example, I think you'd really want the store itself
> to be done with "stlr" (store with release), and then follow up with a
> "dsb st" after that.

Agree with the "stlr" but don't (yet, anyway) understand the need for
a subsequent "dsb st".

> And notice how that requires you to mark the store itself. There is no
> actual barrier *after* the store that does the optimized model.

And marking the store itself is a very good thing from my viewpoint.

> Of course, it's entirely possible that 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 12:22:54PM +0100, Peter Zijlstra wrote:
 On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote:
  OK, something like this for the definitions (though PowerPC might want
  to locally abstract the lwsync expansion):
  
  #define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
  do { \
  barrier(); \
  ACCESS_ONCE(p) = (v); \
  } while (0)
  
  #define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
  do { \
  __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :memory); \
  ACCESS_ONCE(p) = (v); \
  } while (0)
  
  #define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
  ({ \
  typeof(*p) *_p1 = ACCESS_ONCE(p); \
  barrier(); \
  _p1; \
  })
  
  #define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
  ({ \
  typeof(*p) *_p1 = ACCESS_ONCE(p); \
  __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :memory); \
  _p1; \
  })
  
  For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
  ldar instruction and smp_store_with_release_semantics() is a wrapper
  around the ARM stlr instruction.
 
 This still leaves me confused as to what to do with my case :/
 
 Slightly modified since last time -- as the simplified version was maybe
 simplified too far.
 
 To recap, I'd like to get rid of barrier A where possible, since that's
 now a full barrier for every event written.
 
 However, there's no immediate store I can attach it to; the obvious one
 would be the kbuf-head store, but that's complicated by the
 local_cmpxchg() thing.
 
 And we need that cmpxchg loop because a hardware NMI event can
 interleave with a software event.
 
 And to be honest, I'm still totally confused about memory barriers vs
 control flow vs C/C++. The only way we're ever getting to that memcpy is
 if we've already observed ubuf-tail, so that LOAD has to be fully
 processes and completed.
 
 I'm really not seeing how a STORE from the memcpy() could possibly go
 wrong; and if C/C++ can hoist the memcpy() over a compiler barrier()
 then I suppose we should all just go home.
 
 /me who wants A to be a barrier() but is terminally confused.

Well, let's see...

 ---
 
 
 /*
  * One important detail is that the kbuf part and the kbuf_writer() are
  * strictly per cpu and we can thus rely on program order for those.
  *
  * Only the userspace consumer can possibly run on another cpu, and thus we
  * need to ensure data consistency for those.
  */
 
 struct buffer {
 u64 size;
 u64 tail;
 u64 head;
 void *data;
 };
 
 struct buffer *kbuf, *ubuf;
 
 /*
  * If there's space in the buffer; store the data @buf; otherwise
  * discard it.
  */
 void kbuf_write(int sz, void *buf)
 {
   u64 tail, head, offset;
 
   do {
   tail = ACCESS_ONCE(ubuf-tail);

So the above load is the key load.  It determines whether or not we
have space in the buffer.  This of course assumes that only this CPU
writes to -head.

If so, then:

tail = smp_load_with_acquire_semantics(ubuf-tail); /* A - D */

   offset = head = kbuf-head;
   if (CIRC_SPACE(head, tail, kbuf-size)  sz) {
   /* discard @buf */
   return;
   }
   head += sz;
   } while (local_cmpxchg(kbuf-head, offset, head) != offset)

If there is an issue with kbuf-head, presumably local_cmpxchg() fails
and we retry.

But sheesh, do you think we could have buried the definitions of
local_cmpxchg() under a few more layers of macro expansion just to
keep things more obscure?  Anyway, griping aside...

o   __cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h
doesn't seem to exclude NMIs, so is not safe for this usage.

o   __cmpxchg_local() in ARM handles NMI as long as the
argument is 32 bits, otherwise, it uses the aforementionted
__cmpxchg_local_generic(), which does not handle NMI.  Given your
u64, this does not look good...

And some ARM subarches (e.g., metag) seem to fail to handle NMI
even in the 32-bit case.

o   FRV and M32r seem to act similar to ARM.

Or maybe these architectures don't do NMIs?  If they do, local_cmpxchg()
does not seem to be safe against NMIs in general.  :-/

That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it.

Of course, x86's local_cmpxchg() has full memory barriers implicitly.

 
 /*
  * Ensure that if we see the userspace tail (ubuf-tail) such
  * that there is space to write @buf without overwriting data
  * userspace hasn't seen yet, we won't in fact store data before
  * that read completes.
  */
 
 smp_mb(); /* A, matches with D */

Given a change to smp_load_with_acquire_semantics() above, you should not

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
  
  
  /*
   * One important detail is that the kbuf part and the kbuf_writer() are
   * strictly per cpu and we can thus rely on program order for those.
   *
   * Only the userspace consumer can possibly run on another cpu, and thus we
   * need to ensure data consistency for those.
   */
  
  struct buffer {
  u64 size;
  u64 tail;
  u64 head;
  void *data;
  };
  
  struct buffer *kbuf, *ubuf;
  
  /*
   * If there's space in the buffer; store the data @buf; otherwise
   * discard it.
   */
  void kbuf_write(int sz, void *buf)
  {
  u64 tail, head, offset;
  
  do {
  tail = ACCESS_ONCE(ubuf-tail);
 
 So the above load is the key load.  It determines whether or not we
 have space in the buffer.  This of course assumes that only this CPU
 writes to -head.

This assumption is true.

 If so, then:
 
   tail = smp_load_with_acquire_semantics(ubuf-tail); /* A - D */
 

OK, the way I understand ACQUIRE semantics are the semi-permeable LOCK
semantics from Documetnation/memory-barriers.txt. In which case the
relevant STORES below could be hoisted up here, but not across the READ,
which I suppose is sufficient.

  offset = head = kbuf-head;
  if (CIRC_SPACE(head, tail, kbuf-size)  sz) {
  /* discard @buf */
  return;
  }
  head += sz;
  } while (local_cmpxchg(kbuf-head, offset, head) != offset)
 
 If there is an issue with kbuf-head, presumably local_cmpxchg() fails
 and we retry.
 
 But sheesh, do you think we could have buried the definitions of
 local_cmpxchg() under a few more layers of macro expansion just to
 keep things more obscure?  Anyway, griping aside...
 
 o __cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h
   doesn't seem to exclude NMIs, so is not safe for this usage.
 
 o __cmpxchg_local() in ARM handles NMI as long as the
   argument is 32 bits, otherwise, it uses the aforementionted
   __cmpxchg_local_generic(), which does not handle NMI.  Given your
   u64, this does not look good...
 
   And some ARM subarches (e.g., metag) seem to fail to handle NMI
   even in the 32-bit case.
 
 o FRV and M32r seem to act similar to ARM.
 
 Or maybe these architectures don't do NMIs?  If they do, local_cmpxchg()
 does not seem to be safe against NMIs in general.  :-/
 
 That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it.

Ah my bad, so the in-kernel kbuf variant uses unsigned long, which on
all archs should be the native words size and cover the address space.

Only the public variant (ubuf) is u64 wide to not change data structure
layout on compat etc.

I suppose this was a victim on the simplification :/

And in case of 32bit the upper word will always be zero and the partial
reads should all work out just fine.

 Of course, x86's local_cmpxchg() has full memory barriers implicitly.

Not quite, the 'lock' in __raw_cmpxchg() expands to  due to
__cmpxchg_local(), etc.. 

  
  /*
   * Ensure that if we see the userspace tail (ubuf-tail) such
   * that there is space to write @buf without overwriting data
   * userspace hasn't seen yet, we won't in fact store data before
   * that read completes.
   */
  
  smp_mb(); /* A, matches with D */
 
 Given a change to smp_load_with_acquire_semantics() above, you should not
 need this smp_mb().

Because the STORES can not be hoisted across the ACQUIRE, indeed.

 
  memcpy(kbuf-data + offset, buf, sz);
  
  /*
   * Ensure that we write all the @buf data before we update the
   * userspace visible ubuf-head pointer.
   */
  smp_wmb(); /* B, matches with C */
  
  ubuf-head = kbuf-head;
 
 Replace the smp_wmb() and the assignment with:
 
   smp_store_with_release_semantics(ubuf-head, kbuf-head); /* B - C */

And here the RELEASE semantics I assume are the same as the
semi-permeable UNLOCK from _The_ document? In which case the above
STORES cannot be lowered across this store and all should, again, be
well.

  }
  
  /*
   * Consume the buffer data and update the tail pointer to indicate to
   * kernel space there's 'free' space.
   */
  void ubuf_read(void)
  {
  u64 head, tail;
  
  tail = ACCESS_ONCE(ubuf-tail);
 
 Does anyone else write tail?  Or is this defense against NMIs?

No, we're the sole writer; just general paranoia. Not sure the actual
userspace does this; /me checks. Nope, distinct lack of ACCESS_ONCE()
there, just the rmb(), which including a barrier() should hopefully
accomplish similar things most of the time ;-)

I'll need to introduce ACCESS_ONCE to the userspace code.

 If no one else writes to tail and if NMIs cannot muck things up, then
 the above ACCESS_ONCE() is not needed, though I would not object to its
 staying.
 
  

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 11:05:53AM +, Will Deacon wrote:
 On Sun, Nov 03, 2013 at 11:34:00PM +, Linus Torvalds wrote:
  So it would *kind* of act like a smp_wmb() + smp_rmb(), but the
  problem is that a smp_rmb() doesn't really attach to the preceding
  write.
 
 Agreed.
 
  This is analogous to a acquire operation: you cannot make an
  acquire barrier, because it's not a barrier *between* two ops, it's
  associated with one particular op.
  
  So what I *think* you actually really really want is a store with
  release consistency, followed by a write barrier.
 
 How does that order reads against reads? (Paul mentioned this as a
 requirement). I not clear about the use case for this, so perhaps there is a
 dependency that I'm not aware of.

An smp_store_with_release_semantics() orders against prior reads -and-
writes.  It maps to barrier() for x86, stlr for ARM, and lwsync for
PowerPC, as called out in my prototype definitions.

  In TSO, afaik all stores have release consistency, and all writes are
  ordered, which is why this is a no-op in TSO. And x86 also has that
  all stores have release consistency, and all writes are ordered
  model, even if TSO doesn't really describe the x86 model.
  
  But on ARM64, for example, I think you'd really want the store itself
  to be done with stlr (store with release), and then follow up with a
  dsb st after that.
 
 So a dsb is pretty heavyweight here (it prevents execution of *any* further
 instructions until all preceeding stores have completed, as well as
 ensuring completion of any ongoing cache flushes). In conjunction with the
 store-release, that's going to hold everything up until the store-release
 (and therefore any preceeding memory accesses) have completed. Granted, I
 think that gives Paul his read/read ordering, but it's a lot heavier than
 what's required.

I do not believe that we need the trailing dsb st.

  And notice how that requires you to mark the store itself. There is no
  actual barrier *after* the store that does the optimized model.
  
  Of course, it's entirely possible that it's not worth worrying about
  this on ARM64, and that just doing it as a normal store followed by a
  full memory barrier is good enough. But at least in *theory* a
  microarchitecture might make it much cheaper to do a store with
  release consistency followed by write barrier.
 
 I agree with the sentiment but, given that this stuff is so heavily
 microarchitecture-dependent (and not simple to probe), a simple dmb ish
 might be the best option after all. That's especially true if the
 microarchitecture decided to ignore the barrier options and treat everything
 as `all accesses, full system' in order to keep the hardware design simple.

I believe that we can do quite a bit better with current hardware
instructions (in the case of ARM, for a recent definition of current)
and also simplify the memory ordering quite a bit.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
 All this is leading me to suggest the following shortenings of names:
 
   smp_load_with_acquire_semantics() - smp_load_acquire()
 
   smp_store_with_release_semantics() - smp_store_release()
 
 But names aside, the above gets rid of explicit barriers on TSO architectures,
 allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
 the heavier-weight sync.

A little something like this? Completely guessed at the arm/arm64/ia64
asm, but at least for those archs I found proper instructions (I hope),
for x86,sparc,s390 which are TSO we can do with a barrier and PPC like
said can do with the lwsync, all others fall back to using a smp_mb().

Should probably come with a proper changelog and an addition to _The_
document.

---
 arch/alpha/include/asm/barrier.h  | 13 +++
 arch/arc/include/asm/barrier.h| 13 +++
 arch/arm/include/asm/barrier.h| 26 +
 arch/arm64/include/asm/barrier.h  | 28 +++
 arch/avr32/include/asm/barrier.h  | 12 ++
 arch/blackfin/include/asm/barrier.h   | 13 +++
 arch/cris/include/asm/barrier.h   | 13 +++
 arch/frv/include/asm/barrier.h| 13 +++
 arch/h8300/include/asm/barrier.h  | 13 +++
 arch/hexagon/include/asm/barrier.h| 13 +++
 arch/ia64/include/asm/barrier.h   | 43 +++
 arch/m32r/include/asm/barrier.h   | 13 +++
 arch/m68k/include/asm/barrier.h   | 13 +++
 arch/metag/include/asm/barrier.h  | 13 +++
 arch/microblaze/include/asm/barrier.h | 13 +++
 arch/mips/include/asm/barrier.h   | 13 +++
 arch/mn10300/include/asm/barrier.h| 13 +++
 arch/parisc/include/asm/barrier.h | 13 +++
 arch/powerpc/include/asm/barrier.h| 15 
 arch/s390/include/asm/barrier.h   | 13 +++
 arch/score/include/asm/barrier.h  | 13 +++
 arch/sh/include/asm/barrier.h | 13 +++
 arch/sparc/include/asm/barrier_32.h   | 13 +++
 arch/sparc/include/asm/barrier_64.h   | 13 +++
 arch/tile/include/asm/barrier.h   | 13 +++
 arch/unicore32/include/asm/barrier.h  | 13 +++
 arch/x86/include/asm/barrier.h| 13 +++
 arch/xtensa/include/asm/barrier.h | 13 +++
 28 files changed, 423 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..464139feee97 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,19 @@ __asm__ __volatile__(mb: : :memory)
 #define smp_read_barrier_depends() do { } while (0)
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..a779da846fb5 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,19 @@
 #define smp_wmb()   barrier()
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define smp_mb__before_atomic_dec()barrier()
 #define smp_mb__after_atomic_dec() barrier()
 #define smp_mb__before_atomic_inc()barrier()
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..a804093d6891 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,10 +53,36 @@
 #define smp_mb()   barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
+

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
 +#define smp_load_acquire(p, v)   
 \

I R idiot!! :-)

---
 arch/alpha/include/asm/barrier.h  | 13 +++
 arch/arc/include/asm/barrier.h| 13 +++
 arch/arm/include/asm/barrier.h| 26 +
 arch/arm64/include/asm/barrier.h  | 28 +++
 arch/avr32/include/asm/barrier.h  | 12 ++
 arch/blackfin/include/asm/barrier.h   | 13 +++
 arch/cris/include/asm/barrier.h   | 13 +++
 arch/frv/include/asm/barrier.h| 13 +++
 arch/h8300/include/asm/barrier.h  | 13 +++
 arch/hexagon/include/asm/barrier.h| 13 +++
 arch/ia64/include/asm/barrier.h   | 43 +++
 arch/m32r/include/asm/barrier.h   | 13 +++
 arch/m68k/include/asm/barrier.h   | 13 +++
 arch/metag/include/asm/barrier.h  | 13 +++
 arch/microblaze/include/asm/barrier.h | 13 +++
 arch/mips/include/asm/barrier.h   | 13 +++
 arch/mn10300/include/asm/barrier.h| 13 +++
 arch/parisc/include/asm/barrier.h | 13 +++
 arch/powerpc/include/asm/barrier.h| 15 
 arch/s390/include/asm/barrier.h   | 13 +++
 arch/score/include/asm/barrier.h  | 13 +++
 arch/sh/include/asm/barrier.h | 13 +++
 arch/sparc/include/asm/barrier_32.h   | 13 +++
 arch/sparc/include/asm/barrier_64.h   | 13 +++
 arch/tile/include/asm/barrier.h   | 13 +++
 arch/unicore32/include/asm/barrier.h  | 13 +++
 arch/x86/include/asm/barrier.h| 13 +++
 arch/xtensa/include/asm/barrier.h | 13 +++
 28 files changed, 423 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..464139feee97 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,19 @@ __asm__ __volatile__(mb: : :memory)
 #define smp_read_barrier_depends() do { } while (0)
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..a779da846fb5 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,19 @@
 #define smp_wmb()   barrier()
 #endif
 
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1;   \
+} while (0)
+
 #define smp_mb__before_atomic_dec()barrier()
 #define smp_mb__after_atomic_dec() barrier()
 #define smp_mb__before_atomic_inc()barrier()
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..4ada4720bdeb 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,10 +53,36 @@
 #define smp_mb()   barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
+
+#define smp_store_release(p, v)
\
+do {   \
+   smp_mb();   \
+   ACCESS_ONCE(p) = (v);   \
+} while (0)
+
+#define smp_load_acquire(p)\
+do {   \
+   typeof(p) ___p1 = ACCESS_ONCE(p);   \
+   smp_mb();   \
+   return ___p1; 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
 On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
  All this is leading me to suggest the following shortenings of names:
  
  smp_load_with_acquire_semantics() - smp_load_acquire()
  
  smp_store_with_release_semantics() - smp_store_release()
  
  But names aside, the above gets rid of explicit barriers on TSO 
  architectures,
  allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
  the heavier-weight sync.
 
 A little something like this? Completely guessed at the arm/arm64/ia64
 asm, but at least for those archs I found proper instructions (I hope),
 for x86,sparc,s390 which are TSO we can do with a barrier and PPC like
 said can do with the lwsync, all others fall back to using a smp_mb().
 
 Should probably come with a proper changelog and an addition to _The_
 document.

Maybe something like this for the changelog?

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.

Some comments below.  I believe that opcodes need to be fixed for IA64.
I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
be able to tell us.

Other than that, for the rest:

Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

 ---
  arch/alpha/include/asm/barrier.h  | 13 +++
  arch/arc/include/asm/barrier.h| 13 +++
  arch/arm/include/asm/barrier.h| 26 +
  arch/arm64/include/asm/barrier.h  | 28 +++
  arch/avr32/include/asm/barrier.h  | 12 ++
  arch/blackfin/include/asm/barrier.h   | 13 +++
  arch/cris/include/asm/barrier.h   | 13 +++
  arch/frv/include/asm/barrier.h| 13 +++
  arch/h8300/include/asm/barrier.h  | 13 +++
  arch/hexagon/include/asm/barrier.h| 13 +++
  arch/ia64/include/asm/barrier.h   | 43 
 +++
  arch/m32r/include/asm/barrier.h   | 13 +++
  arch/m68k/include/asm/barrier.h   | 13 +++
  arch/metag/include/asm/barrier.h  | 13 +++
  arch/microblaze/include/asm/barrier.h | 13 +++
  arch/mips/include/asm/barrier.h   | 13 +++
  arch/mn10300/include/asm/barrier.h| 13 +++
  arch/parisc/include/asm/barrier.h | 13 +++
  arch/powerpc/include/asm/barrier.h| 15 
  arch/s390/include/asm/barrier.h   | 13 +++
  arch/score/include/asm/barrier.h  | 13 +++
  arch/sh/include/asm/barrier.h | 13 +++
  arch/sparc/include/asm/barrier_32.h   | 13 +++
  arch/sparc/include/asm/barrier_64.h   | 13 +++
  arch/tile/include/asm/barrier.h   | 13 +++
  arch/unicore32/include/asm/barrier.h  | 13 +++
  arch/x86/include/asm/barrier.h| 13 +++
  arch/xtensa/include/asm/barrier.h | 13 +++
  28 files changed, 423 insertions(+)
 
 diff --git a/arch/alpha/include/asm/barrier.h 
 b/arch/alpha/include/asm/barrier.h
 index ce8860a0b32d..464139feee97 100644
 --- a/arch/alpha/include/asm/barrier.h
 +++ b/arch/alpha/include/asm/barrier.h
 @@ -29,6 +29,19 @@ __asm__ __volatile__(mb: : :memory)
  #define smp_read_barrier_depends()   do { } while (0)
  #endif
 
 +#define smp_store_release(p, v)  
 \
 +do { \
 + smp_mb();   \
 + ACCESS_ONCE(p) = (v);   \
 +} while (0)
 +
 +#define 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 08:18:11PM +0100, Peter Zijlstra wrote:
 On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
  +#define smp_load_acquire(p, v) 
  \
 
 I R idiot!! :-)

OK, I did miss this one as well...  :-/

Thanx, Paul

 ---
  arch/alpha/include/asm/barrier.h  | 13 +++
  arch/arc/include/asm/barrier.h| 13 +++
  arch/arm/include/asm/barrier.h| 26 +
  arch/arm64/include/asm/barrier.h  | 28 +++
  arch/avr32/include/asm/barrier.h  | 12 ++
  arch/blackfin/include/asm/barrier.h   | 13 +++
  arch/cris/include/asm/barrier.h   | 13 +++
  arch/frv/include/asm/barrier.h| 13 +++
  arch/h8300/include/asm/barrier.h  | 13 +++
  arch/hexagon/include/asm/barrier.h| 13 +++
  arch/ia64/include/asm/barrier.h   | 43 
 +++
  arch/m32r/include/asm/barrier.h   | 13 +++
  arch/m68k/include/asm/barrier.h   | 13 +++
  arch/metag/include/asm/barrier.h  | 13 +++
  arch/microblaze/include/asm/barrier.h | 13 +++
  arch/mips/include/asm/barrier.h   | 13 +++
  arch/mn10300/include/asm/barrier.h| 13 +++
  arch/parisc/include/asm/barrier.h | 13 +++
  arch/powerpc/include/asm/barrier.h| 15 
  arch/s390/include/asm/barrier.h   | 13 +++
  arch/score/include/asm/barrier.h  | 13 +++
  arch/sh/include/asm/barrier.h | 13 +++
  arch/sparc/include/asm/barrier_32.h   | 13 +++
  arch/sparc/include/asm/barrier_64.h   | 13 +++
  arch/tile/include/asm/barrier.h   | 13 +++
  arch/unicore32/include/asm/barrier.h  | 13 +++
  arch/x86/include/asm/barrier.h| 13 +++
  arch/xtensa/include/asm/barrier.h | 13 +++
  28 files changed, 423 insertions(+)
 
 diff --git a/arch/alpha/include/asm/barrier.h 
 b/arch/alpha/include/asm/barrier.h
 index ce8860a0b32d..464139feee97 100644
 --- a/arch/alpha/include/asm/barrier.h
 +++ b/arch/alpha/include/asm/barrier.h
 @@ -29,6 +29,19 @@ __asm__ __volatile__(mb: : :memory)
  #define smp_read_barrier_depends()   do { } while (0)
  #endif
 
 +#define smp_store_release(p, v)  
 \
 +do { \
 + smp_mb();   \
 + ACCESS_ONCE(p) = (v);   \
 +} while (0)
 +
 +#define smp_load_acquire(p)  \
 +do { \
 + typeof(p) ___p1 = ACCESS_ONCE(p);   \
 + smp_mb();   \
 + return ___p1;   \
 +} while (0)
 +
  #define set_mb(var, value) \
  do { var = value; mb(); } while (0)
 
 diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
 index f6cb7c4ffb35..a779da846fb5 100644
 --- a/arch/arc/include/asm/barrier.h
 +++ b/arch/arc/include/asm/barrier.h
 @@ -30,6 +30,19 @@
  #define smp_wmb()   barrier()
  #endif
 
 +#define smp_store_release(p, v)  
 \
 +do { \
 + smp_mb();   \
 + ACCESS_ONCE(p) = (v);   \
 +} while (0)
 +
 +#define smp_load_acquire(p)  \
 +do { \
 + typeof(p) ___p1 = ACCESS_ONCE(p);   \
 + smp_mb();   \
 + return ___p1;   \
 +} while (0)
 +
  #define smp_mb__before_atomic_dec()  barrier()
  #define smp_mb__after_atomic_dec()   barrier()
  #define smp_mb__before_atomic_inc()  barrier()
 diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
 index 60f15e274e6d..4ada4720bdeb 100644
 --- a/arch/arm/include/asm/barrier.h
 +++ b/arch/arm/include/asm/barrier.h
 @@ -53,10 +53,36 @@
  #define smp_mb() barrier()
  #define smp_rmb()barrier()
  #define smp_wmb()barrier()
 +
 +#define smp_store_release(p, v)  
 \
 +do { \
 + smp_mb();   \
 + ACCESS_ONCE(p) = (v);   \
 +} while (0)
 +
 +#define smp_load_acquire(p)  \
 +do {  

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Paul E. McKenney
On Sun, Nov 03, 2013 at 03:34:00PM -0800, Linus Torvalds wrote:
 On Sun, Nov 3, 2013 at 2:42 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  smp_storebuffer_mb() -- A barrier that enforces those orderings
  that do not invalidate the hardware store-buffer optimization.
 
 Ugh. Maybe. Can you guarantee that those are the correct semantics?
 And why talk about the hardware semantics, when you really want
 specific semantics for the *software*.
 
  smp_not_w_r_mb() -- A barrier that orders everything except prior
  writes against subsequent reads.
 
 Ok, that sounds more along the lines of these are the semantics we
 want, but I have to say, it also doesn't make me go ahh, ok.
 
  smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
  semantics.  (C/C++ acquire orders a specific load against
  subsequent loads and stores, while C/C++ release orders
  a specific store against prior loads and stores.)
 
 I don't think this is true. acquire+release is much stronger than what
 you're looking for - it doesn't allow subsequent reads to move past
 the write (because that would violate the acquire part). On x86, for
 example, you'd need to have a locked cycle for smp_acqrel_mb().
 
 So again, what are the guarantees you actually want? Describe those.
 And then make a name.

I was thinking in terms of the guarantee that TSO systems provide
given a barrier() directive, and that PowerPC provides given the lwsync
instruction.  This guarantee is that loads preceding the barrier will
not be reordered with memory referenced following the barrier, and that
stores preceding the barrier will not be reordered with stores following
the barrier.  But given how much easier RCU reviews became after burying
smp_wmb() and smp_read_barrier_depends() into rcu_assign_pointer() and
rcu_dereference(), respectively, I think I prefer an extension of your
idea below.

 I _think_ the guarantees you want is:
  - SMP write barrier
  - *local* read barrier for reads preceding the write.
 
 but the problem is that the preceding reads part is really
 specifically about the write that you had. The barrier should really
 be attached to the *particular* write operation, it cannot be a
 standalone barrier.

Indeed, neither rcu_assign_pointer() nor the circular queue really needs a
standalone barrier, so that attaching the barrier to a particular memory
reference would work.  And as you note below, in the case of ARM this
would turn into one of their new memory-reference instructions.

 So it would *kind* of act like a smp_wmb() + smp_rmb(), but the
 problem is that a smp_rmb() doesn't really attach to the preceding
 write.
 
 This is analogous to a acquire operation: you cannot make an
 acquire barrier, because it's not a barrier *between* two ops, it's
 associated with one particular op.

But you -could- use any barrier that prevented reordering of any preceding
load with any subsequent memory reference.  Please note that I am -not-
advocating this anymore, because I like the idea of attaching the barrier
to a particular memory operation.  However, for completeness, here it is
in the case of TSO systems and PowerPC, respectively:

#define smp_acquire_mb() barrier();

#define smp_acquire_mb() \
__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :memory);

This functions correctly, but is a pain to review because you have to
figure out which of many possible preceding loads the smp_acquire_mb()
is supposed to attach to.  As you say, it is -way- better to attach the
barrier to a particular memory operation.

 So what I *think* you actually really really want is a store with
 release consistency, followed by a write barrier.

I believe that the combination of store with release consistency and
load with acquire consistency should do the trick for the two use cases
at this point, which again are circular buffers and rcu_assign_pointer().
At this point, I don't see the need for followed by a write barrier.
But I step through the circular buffers below.

 In TSO, afaik all stores have release consistency, and all writes are
 ordered, which is why this is a no-op in TSO. And x86 also has that
 all stores have release consistency, and all writes are ordered
 model, even if TSO doesn't really describe the x86 model.

Yep, as does the mainframe.  And these architectures also have all reads
having acquire consistency.

 But on ARM64, for example, I think you'd really want the store itself
 to be done with stlr (store with release), and then follow up with a
 dsb st after that.

Agree with the stlr but don't (yet, anyway) understand the need for
a subsequent dsb st.

 And notice how that requires you to mark the store itself. There is no
 actual barrier *after* the store that does the optimized model.

And marking the store itself is a very good thing from my viewpoint.

 Of course, it's entirely possible that it's not worth worrying about
 this on ARM64, and that just doing it as a normal 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Will Deacon
On Sun, Nov 03, 2013 at 11:34:00PM +, Linus Torvalds wrote:
 So it would *kind* of act like a smp_wmb() + smp_rmb(), but the
 problem is that a smp_rmb() doesn't really attach to the preceding
 write.

Agreed.

 This is analogous to a acquire operation: you cannot make an
 acquire barrier, because it's not a barrier *between* two ops, it's
 associated with one particular op.
 
 So what I *think* you actually really really want is a store with
 release consistency, followed by a write barrier.

How does that order reads against reads? (Paul mentioned this as a
requirement). I not clear about the use case for this, so perhaps there is a
dependency that I'm not aware of.

 In TSO, afaik all stores have release consistency, and all writes are
 ordered, which is why this is a no-op in TSO. And x86 also has that
 all stores have release consistency, and all writes are ordered
 model, even if TSO doesn't really describe the x86 model.
 
 But on ARM64, for example, I think you'd really want the store itself
 to be done with stlr (store with release), and then follow up with a
 dsb st after that.

So a dsb is pretty heavyweight here (it prevents execution of *any* further
instructions until all preceeding stores have completed, as well as
ensuring completion of any ongoing cache flushes). In conjunction with the
store-release, that's going to hold everything up until the store-release
(and therefore any preceeding memory accesses) have completed. Granted, I
think that gives Paul his read/read ordering, but it's a lot heavier than
what's required.

 And notice how that requires you to mark the store itself. There is no
 actual barrier *after* the store that does the optimized model.
 
 Of course, it's entirely possible that it's not worth worrying about
 this on ARM64, and that just doing it as a normal store followed by a
 full memory barrier is good enough. But at least in *theory* a
 microarchitecture might make it much cheaper to do a store with
 release consistency followed by write barrier.

I agree with the sentiment but, given that this stuff is so heavily
microarchitecture-dependent (and not simple to probe), a simple dmb ish
might be the best option after all. That's especially true if the
microarchitecture decided to ignore the barrier options and treat everything
as `all accesses, full system' in order to keep the hardware design simple.

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-04 Thread Peter Zijlstra
On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote:
 OK, something like this for the definitions (though PowerPC might want
 to locally abstract the lwsync expansion):
 
   #define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
   do { \
   barrier(); \
   ACCESS_ONCE(p) = (v); \
   } while (0)
 
   #define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
   do { \
   __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :memory); \
   ACCESS_ONCE(p) = (v); \
   } while (0)
 
   #define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
   ({ \
   typeof(*p) *_p1 = ACCESS_ONCE(p); \
   barrier(); \
   _p1; \
   })
 
   #define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
   ({ \
   typeof(*p) *_p1 = ACCESS_ONCE(p); \
   __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :memory); \
   _p1; \
   })
 
 For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
 ldar instruction and smp_store_with_release_semantics() is a wrapper
 around the ARM stlr instruction.

This still leaves me confused as to what to do with my case :/

Slightly modified since last time -- as the simplified version was maybe
simplified too far.

To recap, I'd like to get rid of barrier A where possible, since that's
now a full barrier for every event written.

However, there's no immediate store I can attach it to; the obvious one
would be the kbuf-head store, but that's complicated by the
local_cmpxchg() thing.

And we need that cmpxchg loop because a hardware NMI event can
interleave with a software event.

And to be honest, I'm still totally confused about memory barriers vs
control flow vs C/C++. The only way we're ever getting to that memcpy is
if we've already observed ubuf-tail, so that LOAD has to be fully
processes and completed.

I'm really not seeing how a STORE from the memcpy() could possibly go
wrong; and if C/C++ can hoist the memcpy() over a compiler barrier()
then I suppose we should all just go home.

/me who wants A to be a barrier() but is terminally confused.

---


/*
 * One important detail is that the kbuf part and the kbuf_writer() are
 * strictly per cpu and we can thus rely on program order for those.
 *
 * Only the userspace consumer can possibly run on another cpu, and thus we
 * need to ensure data consistency for those.
 */

struct buffer {
u64 size;
u64 tail;
u64 head;
void *data;
};

struct buffer *kbuf, *ubuf;

/*
 * If there's space in the buffer; store the data @buf; otherwise
 * discard it.
 */
void kbuf_write(int sz, void *buf)
{
u64 tail, head, offset;

do {
tail = ACCESS_ONCE(ubuf-tail);
offset = head = kbuf-head;
if (CIRC_SPACE(head, tail, kbuf-size)  sz) {
/* discard @buf */
return;
}
head += sz;
} while (local_cmpxchg(kbuf-head, offset, head) != offset)

/*
 * Ensure that if we see the userspace tail (ubuf-tail) such
 * that there is space to write @buf without overwriting data
 * userspace hasn't seen yet, we won't in fact store data before
 * that read completes.
 */

smp_mb(); /* A, matches with D */

memcpy(kbuf-data + offset, buf, sz);

/*
 * Ensure that we write all the @buf data before we update the
 * userspace visible ubuf-head pointer.
 */
smp_wmb(); /* B, matches with C */

ubuf-head = kbuf-head;
}

/*
 * Consume the buffer data and update the tail pointer to indicate to
 * kernel space there's 'free' space.
 */
void ubuf_read(void)
{
u64 head, tail;

tail = ACCESS_ONCE(ubuf-tail);
head = ACCESS_ONCE(ubuf-head);

/*
 * Ensure we read the buffer boundaries before the actual buffer
 * data...
 */
smp_rmb(); /* C, matches with B */

while (tail != head) {
obj = ubuf-data + tail;
/* process obj */
tail += obj-size;
tail %= ubuf-size;
}

/*
 * Ensure all data reads are complete before we issue the
 * ubuf-tail update; once that update hits, kbuf_write() can
 * observe and overwrite data.
 */
smp_mb(); /* D, matches with A */

ubuf-tail = tail;
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Linus Torvalds
On Sun, Nov 3, 2013 at 2:42 PM, Paul E. McKenney
 wrote:
>
> smp_storebuffer_mb() -- A barrier that enforces those orderings
> that do not invalidate the hardware store-buffer optimization.

Ugh. Maybe. Can you guarantee that those are the correct semantics?
And why talk about the hardware semantics, when you really want
specific semantics for the *software*.

> smp_not_w_r_mb() -- A barrier that orders everything except prior
> writes against subsequent reads.

Ok, that sounds more along the lines of "these are the semantics we
want", but I have to say, it also doesn't make me go "ahh, ok".

> smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
> semantics.  (C/C++ "acquire" orders a specific load against
> subsequent loads and stores, while C/C++ "release" orders
> a specific store against prior loads and stores.)

I don't think this is true. acquire+release is much stronger than what
you're looking for - it doesn't allow subsequent reads to move past
the write (because that would violate the acquire part). On x86, for
example, you'd need to have a locked cycle for smp_acqrel_mb().

So again, what are the guarantees you actually want? Describe those.
And then make a name.

I _think_ the guarantees you want is:
 - SMP write barrier
 - *local* read barrier for reads preceding the write.

but the problem is that the "preceding reads" part is really
specifically about the write that you had. The barrier should really
be attached to the *particular* write operation, it cannot be a
standalone barrier.

So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
problem is that a "smp_rmb()" doesn't really "attach" to the preceding
write.

This is analogous to a "acquire" operation: you cannot make an
"acquire" barrier, because it's not a barrier *between* two ops, it's
associated with one particular op.

So what I *think* you actually really really want is a "store with
release consistency, followed by a write barrier".

In TSO, afaik all stores have release consistency, and all writes are
ordered, which is why this is a no-op in TSO. And x86 also has that
"all stores have release consistency, and all writes are ordered"
model, even if TSO doesn't really describe the x86 model.

But on ARM64, for example, I think you'd really want the store itself
to be done with "stlr" (store with release), and then follow up with a
"dsb st" after that.

And notice how that requires you to mark the store itself. There is no
actual barrier *after* the store that does the optimized model.

Of course, it's entirely possible that it's not worth worrying about
this on ARM64, and that just doing it as a "normal store followed by a
full memory barrier" is good enough. But at least in *theory* a
microarchitecture might make it much cheaper to do a "store with
release consistency" followed by "write barrier".

Anyway, having talked exhaustively about exactly what semantics you
are after, I *think* the best model would be to just have a

  #define smp_store_with_release_semantics(x, y) ...

and use that *and* a "smp_wmb()" for this (possibly a special
"smp_wmb_after_release()" if that allows people to avoid double
barriers). On x86 (and TSO systems), the
smp_store_with_release_semantics() would be just a regular store, and
the smp_wmb() is obviously a no-op. Other platforms would end up doing
other things.

Hmm?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 07:59:23AM +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2013-11-03 at 16:17 +0100, Peter Zijlstra wrote:
> > On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> > > If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> > 
> > Well, I'm obviously all for introducing this new barrier, for it will
> > reduce a full mfence on x86 to a compiler barrier. And ppc can use
> > lwsync as opposed to sync afaict. Not sure ARM can do better.
> 
> The patch at the *very least* needs a good description of the semantics
> of the barrier, what does it order vs. what etc...

Agreed.  Also it needs a name that people can live with.  We will get
there.  ;-)

Thanx, Paul

> Cheers,
> Ben.
> 
> > ---
> > Subject: arch: Introduce new TSO memory barrier smp_tmb()
> > 
> > A few sites could be downgraded from smp_mb() to smp_tmb() and a few
> > site should be upgraded to smp_tmb() that are now using smp_wmb().
> > 
> > XXX hope PaulMck explains things better..
> > 
> > X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
> > reduces to barrier().
> > 
> > PPC can use lwsync instead of sync
> > 
> > For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
> > is always correct but possibly suboptimal.
> > 
> > Suggested-by: Paul McKenney 
> > Not-Signed-off-by: Peter Zijlstra 
> > ---
> >  arch/alpha/include/asm/barrier.h  | 2 ++
> >  arch/arc/include/asm/barrier.h| 2 ++
> >  arch/arm/include/asm/barrier.h| 2 ++
> >  arch/arm64/include/asm/barrier.h  | 2 ++
> >  arch/avr32/include/asm/barrier.h  | 1 +
> >  arch/blackfin/include/asm/barrier.h   | 1 +
> >  arch/cris/include/asm/barrier.h   | 2 ++
> >  arch/frv/include/asm/barrier.h| 1 +
> >  arch/h8300/include/asm/barrier.h  | 2 ++
> >  arch/hexagon/include/asm/barrier.h| 1 +
> >  arch/ia64/include/asm/barrier.h   | 2 ++
> >  arch/m32r/include/asm/barrier.h   | 2 ++
> >  arch/m68k/include/asm/barrier.h   | 1 +
> >  arch/metag/include/asm/barrier.h  | 3 +++
> >  arch/microblaze/include/asm/barrier.h | 1 +
> >  arch/mips/include/asm/barrier.h   | 3 +++
> >  arch/mn10300/include/asm/barrier.h| 2 ++
> >  arch/parisc/include/asm/barrier.h | 1 +
> >  arch/powerpc/include/asm/barrier.h| 2 ++
> >  arch/s390/include/asm/barrier.h   | 1 +
> >  arch/score/include/asm/barrier.h  | 1 +
> >  arch/sh/include/asm/barrier.h | 2 ++
> >  arch/sparc/include/asm/barrier_32.h   | 1 +
> >  arch/sparc/include/asm/barrier_64.h   | 3 +++
> >  arch/tile/include/asm/barrier.h   | 2 ++
> >  arch/unicore32/include/asm/barrier.h  | 1 +
> >  arch/x86/include/asm/barrier.h| 3 +++
> >  arch/xtensa/include/asm/barrier.h | 1 +
> >  28 files changed, 48 insertions(+)
> > 
> > diff --git a/arch/alpha/include/asm/barrier.h 
> > b/arch/alpha/include/asm/barrier.h
> > index ce8860a0b32d..02ea63897038 100644
> > --- a/arch/alpha/include/asm/barrier.h
> > +++ b/arch/alpha/include/asm/barrier.h
> > @@ -18,12 +18,14 @@ __asm__ __volatile__("mb": : :"memory")
> >  #ifdef CONFIG_SMP
> >  #define __ASM_SMP_MB   "\tmb\n"
> >  #define smp_mb()   mb()
> > +#define smp_tmb()  mb()
> >  #define smp_rmb()  rmb()
> >  #define smp_wmb()  wmb()
> >  #define smp_read_barrier_depends() read_barrier_depends()
> >  #else
> >  #define __ASM_SMP_MB
> >  #define smp_mb()   barrier()
> > +#define smp_tmb()  barrier()
> >  #define smp_rmb()  barrier()
> >  #define smp_wmb()  barrier()
> >  #define smp_read_barrier_depends() do { } while (0)
> > diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> > index f6cb7c4ffb35..456c790fa1ad 100644
> > --- a/arch/arc/include/asm/barrier.h
> > +++ b/arch/arc/include/asm/barrier.h
> > @@ -22,10 +22,12 @@
> >  /* TODO-vineetg verify the correctness of macros here */
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()mb()
> > +#define smp_tmb()  mb()
> >  #define smp_rmb()   rmb()
> >  #define smp_wmb()   wmb()
> >  #else
> >  #define smp_mb()barrier()
> > +#define smp_tmb()  barrier()
> >  #define smp_rmb()   barrier()
> >  #define smp_wmb()   barrier()
> >  #endif
> > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > index 60f15e274e6d..bc88a8505673 100644
> > --- a/arch/arm/include/asm/barrier.h
> > +++ b/arch/arm/include/asm/barrier.h
> > @@ -51,10 +51,12 @@
> >  
> >  #ifndef CONFIG_SMP
> >  #define smp_mb()   barrier()
> > +#define smp_tmb()  barrier()
> >  #define smp_rmb()  barrier()
> >  #define smp_wmb()  barrier()
> >  #else
> >  #define smp_mb()   dmb(ish)
> > +#define smp_tmb()  smp_mb()
> >  #define smp_rmb()  smp_mb()
> >  #define smp_wmb()  dmb(ishst)
> >  #endif
> > diff --git a/arch/arm64/include/asm/barrier.h 
> > b/arch/arm64/include/asm/barrier.h
> > index d4a63338a53c..ec0531f4892f 100644
> > --- a/arch/arm64/include/asm/barrier.h
> > +++ 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Paul E. McKenney
On Sun, Nov 03, 2013 at 09:01:24PM +0100, Peter Zijlstra wrote:
> On Sun, Nov 03, 2013 at 10:08:14AM -0800, Linus Torvalds wrote:
> > On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra  wrote:
> > > On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> > >> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> > >
> > > Well, I'm obviously all for introducing this new barrier, for it will
> > > reduce a full mfence on x86 to a compiler barrier. And ppc can use
> > > lwsync as opposed to sync afaict. Not sure ARM can do better.
> > >
> > > ---
> > > Subject: arch: Introduce new TSO memory barrier smp_tmb()
> > 
> > This is specialized enough that I would *really* like the name to be
> > more descriptive. Compare to the special "smp_read_barrier_depends()"
> > maco: it's unusual, and it has very specific semantics, so it gets a
> > long and descriptive name.
> > 
> > Memory ordering is subtle enough without then using names that are
> > subtle in themselves. mb/rmb/wmb are conceptually pretty simple
> > operations, and very basic when talking about memory ordering.
> > "acquire" and "release" are less simple, but have descriptive names
> > and have very specific uses in locking.
> > 
> > In contrast "smp_tmb()" is a *horrible* name, because TSO is a
> > description of the memory ordering, not of a particular barrier. It's
> > also not even clear that you can have a "tso barrier", since the
> > ordering (like acquire/release) presumably is really about one
> > particular *store*, not about some kind of barrier between different
> > operations.
> > 
> > So please describe exactly what the semantics that barrier has, and
> > then name the barrier that way.
> > 
> > I assume that in this particular case, the semantics RCU wants is
> > "write barrier, and no preceding reads can move past this point".

Its semantics order prior reads against subsequent reads, prior reads
against subsequent writes, and prior writes against subsequent writes.
It does -not- order prior writes against subsequent reads.

> > Calling that "smp_tmb()" is f*cking insane, imnsho.
> 
> Fair enough; from what I could gather the proposed semantics are
> RELEASE+WMB, such that neither reads not writes can cross over, writes
> can't cross back, but reads could.
> 
> Since both RELEASE and WMB are trivial under TSO the entire thing
> collapses.

And here are some candidate names, with no attempt to sort sanity from
insanity:

smp_storebuffer_mb() -- A barrier that enforces those orderings
that do not invalidate the hardware store-buffer optimization.

smp_not_w_r_mb() -- A barrier that orders everything except prior
writes against subsequent reads.

smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
semantics.  (C/C++ "acquire" orders a specific load against
subsequent loads and stores, while C/C++ "release" orders
a specific store against prior loads and stores.)

Others?

> Now I'm currently completely confused as to what C/C++ wrecks vs actual
> proper memory order issues; let alone fully comprehend the case that
> started all this.

Each can result in similar wreckage.  In either case, it is about failing
to guarantee needed orderings.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Benjamin Herrenschmidt
On Sun, 2013-11-03 at 16:17 +0100, Peter Zijlstra wrote:
> On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> > If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> 
> Well, I'm obviously all for introducing this new barrier, for it will
> reduce a full mfence on x86 to a compiler barrier. And ppc can use
> lwsync as opposed to sync afaict. Not sure ARM can do better.

The patch at the *very least* needs a good description of the semantics
of the barrier, what does it order vs. what etc...

Cheers,
Ben.

> ---
> Subject: arch: Introduce new TSO memory barrier smp_tmb()
> 
> A few sites could be downgraded from smp_mb() to smp_tmb() and a few
> site should be upgraded to smp_tmb() that are now using smp_wmb().
> 
> XXX hope PaulMck explains things better..
> 
> X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
> reduces to barrier().
> 
> PPC can use lwsync instead of sync
> 
> For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
> is always correct but possibly suboptimal.
> 
> Suggested-by: Paul McKenney 
> Not-Signed-off-by: Peter Zijlstra 
> ---
>  arch/alpha/include/asm/barrier.h  | 2 ++
>  arch/arc/include/asm/barrier.h| 2 ++
>  arch/arm/include/asm/barrier.h| 2 ++
>  arch/arm64/include/asm/barrier.h  | 2 ++
>  arch/avr32/include/asm/barrier.h  | 1 +
>  arch/blackfin/include/asm/barrier.h   | 1 +
>  arch/cris/include/asm/barrier.h   | 2 ++
>  arch/frv/include/asm/barrier.h| 1 +
>  arch/h8300/include/asm/barrier.h  | 2 ++
>  arch/hexagon/include/asm/barrier.h| 1 +
>  arch/ia64/include/asm/barrier.h   | 2 ++
>  arch/m32r/include/asm/barrier.h   | 2 ++
>  arch/m68k/include/asm/barrier.h   | 1 +
>  arch/metag/include/asm/barrier.h  | 3 +++
>  arch/microblaze/include/asm/barrier.h | 1 +
>  arch/mips/include/asm/barrier.h   | 3 +++
>  arch/mn10300/include/asm/barrier.h| 2 ++
>  arch/parisc/include/asm/barrier.h | 1 +
>  arch/powerpc/include/asm/barrier.h| 2 ++
>  arch/s390/include/asm/barrier.h   | 1 +
>  arch/score/include/asm/barrier.h  | 1 +
>  arch/sh/include/asm/barrier.h | 2 ++
>  arch/sparc/include/asm/barrier_32.h   | 1 +
>  arch/sparc/include/asm/barrier_64.h   | 3 +++
>  arch/tile/include/asm/barrier.h   | 2 ++
>  arch/unicore32/include/asm/barrier.h  | 1 +
>  arch/x86/include/asm/barrier.h| 3 +++
>  arch/xtensa/include/asm/barrier.h | 1 +
>  28 files changed, 48 insertions(+)
> 
> diff --git a/arch/alpha/include/asm/barrier.h 
> b/arch/alpha/include/asm/barrier.h
> index ce8860a0b32d..02ea63897038 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -18,12 +18,14 @@ __asm__ __volatile__("mb": : :"memory")
>  #ifdef CONFIG_SMP
>  #define __ASM_SMP_MB "\tmb\n"
>  #define smp_mb() mb()
> +#define smp_tmb()mb()
>  #define smp_rmb()rmb()
>  #define smp_wmb()wmb()
>  #define smp_read_barrier_depends()   read_barrier_depends()
>  #else
>  #define __ASM_SMP_MB
>  #define smp_mb() barrier()
> +#define smp_tmb()barrier()
>  #define smp_rmb()barrier()
>  #define smp_wmb()barrier()
>  #define smp_read_barrier_depends()   do { } while (0)
> diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> index f6cb7c4ffb35..456c790fa1ad 100644
> --- a/arch/arc/include/asm/barrier.h
> +++ b/arch/arc/include/asm/barrier.h
> @@ -22,10 +22,12 @@
>  /* TODO-vineetg verify the correctness of macros here */
>  #ifdef CONFIG_SMP
>  #define smp_mb()mb()
> +#define smp_tmb()mb()
>  #define smp_rmb()   rmb()
>  #define smp_wmb()   wmb()
>  #else
>  #define smp_mb()barrier()
> +#define smp_tmb()barrier()
>  #define smp_rmb()   barrier()
>  #define smp_wmb()   barrier()
>  #endif
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 60f15e274e6d..bc88a8505673 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -51,10 +51,12 @@
>  
>  #ifndef CONFIG_SMP
>  #define smp_mb() barrier()
> +#define smp_tmb()barrier()
>  #define smp_rmb()barrier()
>  #define smp_wmb()barrier()
>  #else
>  #define smp_mb() dmb(ish)
> +#define smp_tmb()smp_mb()
>  #define smp_rmb()smp_mb()
>  #define smp_wmb()dmb(ishst)
>  #endif
> diff --git a/arch/arm64/include/asm/barrier.h 
> b/arch/arm64/include/asm/barrier.h
> index d4a63338a53c..ec0531f4892f 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -33,10 +33,12 @@
>  
>  #ifndef CONFIG_SMP
>  #define smp_mb() barrier()
> +#define smp_tmb()barrier()
>  #define smp_rmb()barrier()
>  #define smp_wmb()barrier()
>  #else
>  #define smp_mb() asm volatile("dmb ish" : : : "memory")
> +#define smp_tmb()asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Peter Zijlstra
On Sun, Nov 03, 2013 at 10:08:14AM -0800, Linus Torvalds wrote:
> On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra  wrote:
> > On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> >> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> >
> > Well, I'm obviously all for introducing this new barrier, for it will
> > reduce a full mfence on x86 to a compiler barrier. And ppc can use
> > lwsync as opposed to sync afaict. Not sure ARM can do better.
> >
> > ---
> > Subject: arch: Introduce new TSO memory barrier smp_tmb()
> 
> This is specialized enough that I would *really* like the name to be
> more descriptive. Compare to the special "smp_read_barrier_depends()"
> maco: it's unusual, and it has very specific semantics, so it gets a
> long and descriptive name.
> 
> Memory ordering is subtle enough without then using names that are
> subtle in themselves. mb/rmb/wmb are conceptually pretty simple
> operations, and very basic when talking about memory ordering.
> "acquire" and "release" are less simple, but have descriptive names
> and have very specific uses in locking.
> 
> In contrast "smp_tmb()" is a *horrible* name, because TSO is a
> description of the memory ordering, not of a particular barrier. It's
> also not even clear that you can have a "tso barrier", since the
> ordering (like acquire/release) presumably is really about one
> particular *store*, not about some kind of barrier between different
> operations.
> 
> So please describe exactly what the semantics that barrier has, and
> then name the barrier that way.
> 
> I assume that in this particular case, the semantics RCU wants is
> "write barrier, and no preceding reads can move past this point".
> 
> Calling that "smp_tmb()" is f*cking insane, imnsho.

Fair enough; from what I could gather the proposed semantics are
RELEASE+WMB, such that neither reads not writes can cross over, writes
can't cross back, but reads could.

Since both RELEASE and WMB are trivial under TSO the entire thing
collapses.

Now I'm currently completely confused as to what C/C++ wrecks vs actual
proper memory order issues; let alone fully comprehend the case that
started all this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Linus Torvalds
On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra  wrote:
> On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
>> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
>
> Well, I'm obviously all for introducing this new barrier, for it will
> reduce a full mfence on x86 to a compiler barrier. And ppc can use
> lwsync as opposed to sync afaict. Not sure ARM can do better.
>
> ---
> Subject: arch: Introduce new TSO memory barrier smp_tmb()

This is specialized enough that I would *really* like the name to be
more descriptive. Compare to the special "smp_read_barrier_depends()"
maco: it's unusual, and it has very specific semantics, so it gets a
long and descriptive name.

Memory ordering is subtle enough without then using names that are
subtle in themselves. mb/rmb/wmb are conceptually pretty simple
operations, and very basic when talking about memory ordering.
"acquire" and "release" are less simple, but have descriptive names
and have very specific uses in locking.

In contrast "smp_tmb()" is a *horrible* name, because TSO is a
description of the memory ordering, not of a particular barrier. It's
also not even clear that you can have a "tso barrier", since the
ordering (like acquire/release) presumably is really about one
particular *store*, not about some kind of barrier between different
operations.

So please describe exactly what the semantics that barrier has, and
then name the barrier that way.

I assume that in this particular case, the semantics RCU wants is
"write barrier, and no preceding reads can move past this point".

Calling that "smp_tmb()" is f*cking insane, imnsho.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Peter Zijlstra
On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().

Well, I'm obviously all for introducing this new barrier, for it will
reduce a full mfence on x86 to a compiler barrier. And ppc can use
lwsync as opposed to sync afaict. Not sure ARM can do better.

---
Subject: arch: Introduce new TSO memory barrier smp_tmb()

A few sites could be downgraded from smp_mb() to smp_tmb() and a few
site should be upgraded to smp_tmb() that are now using smp_wmb().

XXX hope PaulMck explains things better..

X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
reduces to barrier().

PPC can use lwsync instead of sync

For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
is always correct but possibly suboptimal.

Suggested-by: Paul McKenney 
Not-Signed-off-by: Peter Zijlstra 
---
 arch/alpha/include/asm/barrier.h  | 2 ++
 arch/arc/include/asm/barrier.h| 2 ++
 arch/arm/include/asm/barrier.h| 2 ++
 arch/arm64/include/asm/barrier.h  | 2 ++
 arch/avr32/include/asm/barrier.h  | 1 +
 arch/blackfin/include/asm/barrier.h   | 1 +
 arch/cris/include/asm/barrier.h   | 2 ++
 arch/frv/include/asm/barrier.h| 1 +
 arch/h8300/include/asm/barrier.h  | 2 ++
 arch/hexagon/include/asm/barrier.h| 1 +
 arch/ia64/include/asm/barrier.h   | 2 ++
 arch/m32r/include/asm/barrier.h   | 2 ++
 arch/m68k/include/asm/barrier.h   | 1 +
 arch/metag/include/asm/barrier.h  | 3 +++
 arch/microblaze/include/asm/barrier.h | 1 +
 arch/mips/include/asm/barrier.h   | 3 +++
 arch/mn10300/include/asm/barrier.h| 2 ++
 arch/parisc/include/asm/barrier.h | 1 +
 arch/powerpc/include/asm/barrier.h| 2 ++
 arch/s390/include/asm/barrier.h   | 1 +
 arch/score/include/asm/barrier.h  | 1 +
 arch/sh/include/asm/barrier.h | 2 ++
 arch/sparc/include/asm/barrier_32.h   | 1 +
 arch/sparc/include/asm/barrier_64.h   | 3 +++
 arch/tile/include/asm/barrier.h   | 2 ++
 arch/unicore32/include/asm/barrier.h  | 1 +
 arch/x86/include/asm/barrier.h| 3 +++
 arch/xtensa/include/asm/barrier.h | 1 +
 28 files changed, 48 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..02ea63897038 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -18,12 +18,14 @@ __asm__ __volatile__("mb": : :"memory")
 #ifdef CONFIG_SMP
 #define __ASM_SMP_MB   "\tmb\n"
 #define smp_mb()   mb()
+#define smp_tmb()  mb()
 #define smp_rmb()  rmb()
 #define smp_wmb()  wmb()
 #define smp_read_barrier_depends() read_barrier_depends()
 #else
 #define __ASM_SMP_MB
 #define smp_mb()   barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
 #define smp_read_barrier_depends() do { } while (0)
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..456c790fa1ad 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -22,10 +22,12 @@
 /* TODO-vineetg verify the correctness of macros here */
 #ifdef CONFIG_SMP
 #define smp_mb()mb()
+#define smp_tmb()  mb()
 #define smp_rmb()   rmb()
 #define smp_wmb()   wmb()
 #else
 #define smp_mb()barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()   barrier()
 #define smp_wmb()   barrier()
 #endif
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..bc88a8505673 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -51,10 +51,12 @@
 
 #ifndef CONFIG_SMP
 #define smp_mb()   barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
 #else
 #define smp_mb()   dmb(ish)
+#define smp_tmb()  smp_mb()
 #define smp_rmb()  smp_mb()
 #define smp_wmb()  dmb(ishst)
 #endif
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index d4a63338a53c..ec0531f4892f 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -33,10 +33,12 @@
 
 #ifndef CONFIG_SMP
 #define smp_mb()   barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
 #else
 #define smp_mb()   asm volatile("dmb ish" : : : "memory")
+#define smp_tmb()  asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()  asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()  asm volatile("dmb ishst" : : : "memory")
 #endif
diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
index 0961275373db..6c6ccb9cf290 100644
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -20,6 +20,7 @@
 # error "The AVR32 port does not support SMP"
 #else
 # define smp_mb()  barrier()
+# define smp_tmb()   

[RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Peter Zijlstra
On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
 If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().

Well, I'm obviously all for introducing this new barrier, for it will
reduce a full mfence on x86 to a compiler barrier. And ppc can use
lwsync as opposed to sync afaict. Not sure ARM can do better.

---
Subject: arch: Introduce new TSO memory barrier smp_tmb()

A few sites could be downgraded from smp_mb() to smp_tmb() and a few
site should be upgraded to smp_tmb() that are now using smp_wmb().

XXX hope PaulMck explains things better..

X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
reduces to barrier().

PPC can use lwsync instead of sync

For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
is always correct but possibly suboptimal.

Suggested-by: Paul McKenney paul...@linux.vnet.ibm.com
Not-Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/alpha/include/asm/barrier.h  | 2 ++
 arch/arc/include/asm/barrier.h| 2 ++
 arch/arm/include/asm/barrier.h| 2 ++
 arch/arm64/include/asm/barrier.h  | 2 ++
 arch/avr32/include/asm/barrier.h  | 1 +
 arch/blackfin/include/asm/barrier.h   | 1 +
 arch/cris/include/asm/barrier.h   | 2 ++
 arch/frv/include/asm/barrier.h| 1 +
 arch/h8300/include/asm/barrier.h  | 2 ++
 arch/hexagon/include/asm/barrier.h| 1 +
 arch/ia64/include/asm/barrier.h   | 2 ++
 arch/m32r/include/asm/barrier.h   | 2 ++
 arch/m68k/include/asm/barrier.h   | 1 +
 arch/metag/include/asm/barrier.h  | 3 +++
 arch/microblaze/include/asm/barrier.h | 1 +
 arch/mips/include/asm/barrier.h   | 3 +++
 arch/mn10300/include/asm/barrier.h| 2 ++
 arch/parisc/include/asm/barrier.h | 1 +
 arch/powerpc/include/asm/barrier.h| 2 ++
 arch/s390/include/asm/barrier.h   | 1 +
 arch/score/include/asm/barrier.h  | 1 +
 arch/sh/include/asm/barrier.h | 2 ++
 arch/sparc/include/asm/barrier_32.h   | 1 +
 arch/sparc/include/asm/barrier_64.h   | 3 +++
 arch/tile/include/asm/barrier.h   | 2 ++
 arch/unicore32/include/asm/barrier.h  | 1 +
 arch/x86/include/asm/barrier.h| 3 +++
 arch/xtensa/include/asm/barrier.h | 1 +
 28 files changed, 48 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..02ea63897038 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -18,12 +18,14 @@ __asm__ __volatile__(mb: : :memory)
 #ifdef CONFIG_SMP
 #define __ASM_SMP_MB   \tmb\n
 #define smp_mb()   mb()
+#define smp_tmb()  mb()
 #define smp_rmb()  rmb()
 #define smp_wmb()  wmb()
 #define smp_read_barrier_depends() read_barrier_depends()
 #else
 #define __ASM_SMP_MB
 #define smp_mb()   barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
 #define smp_read_barrier_depends() do { } while (0)
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..456c790fa1ad 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -22,10 +22,12 @@
 /* TODO-vineetg verify the correctness of macros here */
 #ifdef CONFIG_SMP
 #define smp_mb()mb()
+#define smp_tmb()  mb()
 #define smp_rmb()   rmb()
 #define smp_wmb()   wmb()
 #else
 #define smp_mb()barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()   barrier()
 #define smp_wmb()   barrier()
 #endif
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..bc88a8505673 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -51,10 +51,12 @@
 
 #ifndef CONFIG_SMP
 #define smp_mb()   barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
 #else
 #define smp_mb()   dmb(ish)
+#define smp_tmb()  smp_mb()
 #define smp_rmb()  smp_mb()
 #define smp_wmb()  dmb(ishst)
 #endif
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index d4a63338a53c..ec0531f4892f 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -33,10 +33,12 @@
 
 #ifndef CONFIG_SMP
 #define smp_mb()   barrier()
+#define smp_tmb()  barrier()
 #define smp_rmb()  barrier()
 #define smp_wmb()  barrier()
 #else
 #define smp_mb()   asm volatile(dmb ish : : : memory)
+#define smp_tmb()  asm volatile(dmb ish : : : memory)
 #define smp_rmb()  asm volatile(dmb ishld : : : memory)
 #define smp_wmb()  asm volatile(dmb ishst : : : memory)
 #endif
diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
index 0961275373db..6c6ccb9cf290 100644
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -20,6 +20,7 @@
 # error The AVR32 port does not support SMP
 #else
 # define smp_mb()  barrier()
+# 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Linus Torvalds
On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
 If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().

 Well, I'm obviously all for introducing this new barrier, for it will
 reduce a full mfence on x86 to a compiler barrier. And ppc can use
 lwsync as opposed to sync afaict. Not sure ARM can do better.

 ---
 Subject: arch: Introduce new TSO memory barrier smp_tmb()

This is specialized enough that I would *really* like the name to be
more descriptive. Compare to the special smp_read_barrier_depends()
maco: it's unusual, and it has very specific semantics, so it gets a
long and descriptive name.

Memory ordering is subtle enough without then using names that are
subtle in themselves. mb/rmb/wmb are conceptually pretty simple
operations, and very basic when talking about memory ordering.
acquire and release are less simple, but have descriptive names
and have very specific uses in locking.

In contrast smp_tmb() is a *horrible* name, because TSO is a
description of the memory ordering, not of a particular barrier. It's
also not even clear that you can have a tso barrier, since the
ordering (like acquire/release) presumably is really about one
particular *store*, not about some kind of barrier between different
operations.

So please describe exactly what the semantics that barrier has, and
then name the barrier that way.

I assume that in this particular case, the semantics RCU wants is
write barrier, and no preceding reads can move past this point.

Calling that smp_tmb() is f*cking insane, imnsho.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Peter Zijlstra
On Sun, Nov 03, 2013 at 10:08:14AM -0800, Linus Torvalds wrote:
 On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra pet...@infradead.org wrote:
  On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
  If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
 
  Well, I'm obviously all for introducing this new barrier, for it will
  reduce a full mfence on x86 to a compiler barrier. And ppc can use
  lwsync as opposed to sync afaict. Not sure ARM can do better.
 
  ---
  Subject: arch: Introduce new TSO memory barrier smp_tmb()
 
 This is specialized enough that I would *really* like the name to be
 more descriptive. Compare to the special smp_read_barrier_depends()
 maco: it's unusual, and it has very specific semantics, so it gets a
 long and descriptive name.
 
 Memory ordering is subtle enough without then using names that are
 subtle in themselves. mb/rmb/wmb are conceptually pretty simple
 operations, and very basic when talking about memory ordering.
 acquire and release are less simple, but have descriptive names
 and have very specific uses in locking.
 
 In contrast smp_tmb() is a *horrible* name, because TSO is a
 description of the memory ordering, not of a particular barrier. It's
 also not even clear that you can have a tso barrier, since the
 ordering (like acquire/release) presumably is really about one
 particular *store*, not about some kind of barrier between different
 operations.
 
 So please describe exactly what the semantics that barrier has, and
 then name the barrier that way.
 
 I assume that in this particular case, the semantics RCU wants is
 write barrier, and no preceding reads can move past this point.
 
 Calling that smp_tmb() is f*cking insane, imnsho.

Fair enough; from what I could gather the proposed semantics are
RELEASE+WMB, such that neither reads not writes can cross over, writes
can't cross back, but reads could.

Since both RELEASE and WMB are trivial under TSO the entire thing
collapses.

Now I'm currently completely confused as to what C/C++ wrecks vs actual
proper memory order issues; let alone fully comprehend the case that
started all this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Benjamin Herrenschmidt
On Sun, 2013-11-03 at 16:17 +0100, Peter Zijlstra wrote:
 On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
  If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
 
 Well, I'm obviously all for introducing this new barrier, for it will
 reduce a full mfence on x86 to a compiler barrier. And ppc can use
 lwsync as opposed to sync afaict. Not sure ARM can do better.

The patch at the *very least* needs a good description of the semantics
of the barrier, what does it order vs. what etc...

Cheers,
Ben.

 ---
 Subject: arch: Introduce new TSO memory barrier smp_tmb()
 
 A few sites could be downgraded from smp_mb() to smp_tmb() and a few
 site should be upgraded to smp_tmb() that are now using smp_wmb().
 
 XXX hope PaulMck explains things better..
 
 X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
 reduces to barrier().
 
 PPC can use lwsync instead of sync
 
 For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
 is always correct but possibly suboptimal.
 
 Suggested-by: Paul McKenney paul...@linux.vnet.ibm.com
 Not-Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  arch/alpha/include/asm/barrier.h  | 2 ++
  arch/arc/include/asm/barrier.h| 2 ++
  arch/arm/include/asm/barrier.h| 2 ++
  arch/arm64/include/asm/barrier.h  | 2 ++
  arch/avr32/include/asm/barrier.h  | 1 +
  arch/blackfin/include/asm/barrier.h   | 1 +
  arch/cris/include/asm/barrier.h   | 2 ++
  arch/frv/include/asm/barrier.h| 1 +
  arch/h8300/include/asm/barrier.h  | 2 ++
  arch/hexagon/include/asm/barrier.h| 1 +
  arch/ia64/include/asm/barrier.h   | 2 ++
  arch/m32r/include/asm/barrier.h   | 2 ++
  arch/m68k/include/asm/barrier.h   | 1 +
  arch/metag/include/asm/barrier.h  | 3 +++
  arch/microblaze/include/asm/barrier.h | 1 +
  arch/mips/include/asm/barrier.h   | 3 +++
  arch/mn10300/include/asm/barrier.h| 2 ++
  arch/parisc/include/asm/barrier.h | 1 +
  arch/powerpc/include/asm/barrier.h| 2 ++
  arch/s390/include/asm/barrier.h   | 1 +
  arch/score/include/asm/barrier.h  | 1 +
  arch/sh/include/asm/barrier.h | 2 ++
  arch/sparc/include/asm/barrier_32.h   | 1 +
  arch/sparc/include/asm/barrier_64.h   | 3 +++
  arch/tile/include/asm/barrier.h   | 2 ++
  arch/unicore32/include/asm/barrier.h  | 1 +
  arch/x86/include/asm/barrier.h| 3 +++
  arch/xtensa/include/asm/barrier.h | 1 +
  28 files changed, 48 insertions(+)
 
 diff --git a/arch/alpha/include/asm/barrier.h 
 b/arch/alpha/include/asm/barrier.h
 index ce8860a0b32d..02ea63897038 100644
 --- a/arch/alpha/include/asm/barrier.h
 +++ b/arch/alpha/include/asm/barrier.h
 @@ -18,12 +18,14 @@ __asm__ __volatile__(mb: : :memory)
  #ifdef CONFIG_SMP
  #define __ASM_SMP_MB \tmb\n
  #define smp_mb() mb()
 +#define smp_tmb()mb()
  #define smp_rmb()rmb()
  #define smp_wmb()wmb()
  #define smp_read_barrier_depends()   read_barrier_depends()
  #else
  #define __ASM_SMP_MB
  #define smp_mb() barrier()
 +#define smp_tmb()barrier()
  #define smp_rmb()barrier()
  #define smp_wmb()barrier()
  #define smp_read_barrier_depends()   do { } while (0)
 diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
 index f6cb7c4ffb35..456c790fa1ad 100644
 --- a/arch/arc/include/asm/barrier.h
 +++ b/arch/arc/include/asm/barrier.h
 @@ -22,10 +22,12 @@
  /* TODO-vineetg verify the correctness of macros here */
  #ifdef CONFIG_SMP
  #define smp_mb()mb()
 +#define smp_tmb()mb()
  #define smp_rmb()   rmb()
  #define smp_wmb()   wmb()
  #else
  #define smp_mb()barrier()
 +#define smp_tmb()barrier()
  #define smp_rmb()   barrier()
  #define smp_wmb()   barrier()
  #endif
 diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
 index 60f15e274e6d..bc88a8505673 100644
 --- a/arch/arm/include/asm/barrier.h
 +++ b/arch/arm/include/asm/barrier.h
 @@ -51,10 +51,12 @@
  
  #ifndef CONFIG_SMP
  #define smp_mb() barrier()
 +#define smp_tmb()barrier()
  #define smp_rmb()barrier()
  #define smp_wmb()barrier()
  #else
  #define smp_mb() dmb(ish)
 +#define smp_tmb()smp_mb()
  #define smp_rmb()smp_mb()
  #define smp_wmb()dmb(ishst)
  #endif
 diff --git a/arch/arm64/include/asm/barrier.h 
 b/arch/arm64/include/asm/barrier.h
 index d4a63338a53c..ec0531f4892f 100644
 --- a/arch/arm64/include/asm/barrier.h
 +++ b/arch/arm64/include/asm/barrier.h
 @@ -33,10 +33,12 @@
  
  #ifndef CONFIG_SMP
  #define smp_mb() barrier()
 +#define smp_tmb()barrier()
  #define smp_rmb()barrier()
  #define smp_wmb()barrier()
  #else
  #define smp_mb() asm volatile(dmb ish : : : memory)
 +#define smp_tmb()asm volatile(dmb ish : : : memory)
  #define smp_rmb()asm volatile(dmb ishld : : : memory)
  #define smp_wmb()asm volatile(dmb ishst : : : memory)
  #endif
 diff --git a/arch/avr32/include/asm/barrier.h 
 

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Paul E. McKenney
On Sun, Nov 03, 2013 at 09:01:24PM +0100, Peter Zijlstra wrote:
 On Sun, Nov 03, 2013 at 10:08:14AM -0800, Linus Torvalds wrote:
  On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra pet...@infradead.org wrote:
   On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
   If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
  
   Well, I'm obviously all for introducing this new barrier, for it will
   reduce a full mfence on x86 to a compiler barrier. And ppc can use
   lwsync as opposed to sync afaict. Not sure ARM can do better.
  
   ---
   Subject: arch: Introduce new TSO memory barrier smp_tmb()
  
  This is specialized enough that I would *really* like the name to be
  more descriptive. Compare to the special smp_read_barrier_depends()
  maco: it's unusual, and it has very specific semantics, so it gets a
  long and descriptive name.
  
  Memory ordering is subtle enough without then using names that are
  subtle in themselves. mb/rmb/wmb are conceptually pretty simple
  operations, and very basic when talking about memory ordering.
  acquire and release are less simple, but have descriptive names
  and have very specific uses in locking.
  
  In contrast smp_tmb() is a *horrible* name, because TSO is a
  description of the memory ordering, not of a particular barrier. It's
  also not even clear that you can have a tso barrier, since the
  ordering (like acquire/release) presumably is really about one
  particular *store*, not about some kind of barrier between different
  operations.
  
  So please describe exactly what the semantics that barrier has, and
  then name the barrier that way.
  
  I assume that in this particular case, the semantics RCU wants is
  write barrier, and no preceding reads can move past this point.

Its semantics order prior reads against subsequent reads, prior reads
against subsequent writes, and prior writes against subsequent writes.
It does -not- order prior writes against subsequent reads.

  Calling that smp_tmb() is f*cking insane, imnsho.
 
 Fair enough; from what I could gather the proposed semantics are
 RELEASE+WMB, such that neither reads not writes can cross over, writes
 can't cross back, but reads could.
 
 Since both RELEASE and WMB are trivial under TSO the entire thing
 collapses.

And here are some candidate names, with no attempt to sort sanity from
insanity:

smp_storebuffer_mb() -- A barrier that enforces those orderings
that do not invalidate the hardware store-buffer optimization.

smp_not_w_r_mb() -- A barrier that orders everything except prior
writes against subsequent reads.

smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
semantics.  (C/C++ acquire orders a specific load against
subsequent loads and stores, while C/C++ release orders
a specific store against prior loads and stores.)

Others?

 Now I'm currently completely confused as to what C/C++ wrecks vs actual
 proper memory order issues; let alone fully comprehend the case that
 started all this.

Each can result in similar wreckage.  In either case, it is about failing
to guarantee needed orderings.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Paul E. McKenney
On Mon, Nov 04, 2013 at 07:59:23AM +1100, Benjamin Herrenschmidt wrote:
 On Sun, 2013-11-03 at 16:17 +0100, Peter Zijlstra wrote:
  On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
   If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
  
  Well, I'm obviously all for introducing this new barrier, for it will
  reduce a full mfence on x86 to a compiler barrier. And ppc can use
  lwsync as opposed to sync afaict. Not sure ARM can do better.
 
 The patch at the *very least* needs a good description of the semantics
 of the barrier, what does it order vs. what etc...

Agreed.  Also it needs a name that people can live with.  We will get
there.  ;-)

Thanx, Paul

 Cheers,
 Ben.
 
  ---
  Subject: arch: Introduce new TSO memory barrier smp_tmb()
  
  A few sites could be downgraded from smp_mb() to smp_tmb() and a few
  site should be upgraded to smp_tmb() that are now using smp_wmb().
  
  XXX hope PaulMck explains things better..
  
  X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
  reduces to barrier().
  
  PPC can use lwsync instead of sync
  
  For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
  is always correct but possibly suboptimal.
  
  Suggested-by: Paul McKenney paul...@linux.vnet.ibm.com
  Not-Signed-off-by: Peter Zijlstra pet...@infradead.org
  ---
   arch/alpha/include/asm/barrier.h  | 2 ++
   arch/arc/include/asm/barrier.h| 2 ++
   arch/arm/include/asm/barrier.h| 2 ++
   arch/arm64/include/asm/barrier.h  | 2 ++
   arch/avr32/include/asm/barrier.h  | 1 +
   arch/blackfin/include/asm/barrier.h   | 1 +
   arch/cris/include/asm/barrier.h   | 2 ++
   arch/frv/include/asm/barrier.h| 1 +
   arch/h8300/include/asm/barrier.h  | 2 ++
   arch/hexagon/include/asm/barrier.h| 1 +
   arch/ia64/include/asm/barrier.h   | 2 ++
   arch/m32r/include/asm/barrier.h   | 2 ++
   arch/m68k/include/asm/barrier.h   | 1 +
   arch/metag/include/asm/barrier.h  | 3 +++
   arch/microblaze/include/asm/barrier.h | 1 +
   arch/mips/include/asm/barrier.h   | 3 +++
   arch/mn10300/include/asm/barrier.h| 2 ++
   arch/parisc/include/asm/barrier.h | 1 +
   arch/powerpc/include/asm/barrier.h| 2 ++
   arch/s390/include/asm/barrier.h   | 1 +
   arch/score/include/asm/barrier.h  | 1 +
   arch/sh/include/asm/barrier.h | 2 ++
   arch/sparc/include/asm/barrier_32.h   | 1 +
   arch/sparc/include/asm/barrier_64.h   | 3 +++
   arch/tile/include/asm/barrier.h   | 2 ++
   arch/unicore32/include/asm/barrier.h  | 1 +
   arch/x86/include/asm/barrier.h| 3 +++
   arch/xtensa/include/asm/barrier.h | 1 +
   28 files changed, 48 insertions(+)
  
  diff --git a/arch/alpha/include/asm/barrier.h 
  b/arch/alpha/include/asm/barrier.h
  index ce8860a0b32d..02ea63897038 100644
  --- a/arch/alpha/include/asm/barrier.h
  +++ b/arch/alpha/include/asm/barrier.h
  @@ -18,12 +18,14 @@ __asm__ __volatile__(mb: : :memory)
   #ifdef CONFIG_SMP
   #define __ASM_SMP_MB   \tmb\n
   #define smp_mb()   mb()
  +#define smp_tmb()  mb()
   #define smp_rmb()  rmb()
   #define smp_wmb()  wmb()
   #define smp_read_barrier_depends() read_barrier_depends()
   #else
   #define __ASM_SMP_MB
   #define smp_mb()   barrier()
  +#define smp_tmb()  barrier()
   #define smp_rmb()  barrier()
   #define smp_wmb()  barrier()
   #define smp_read_barrier_depends() do { } while (0)
  diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
  index f6cb7c4ffb35..456c790fa1ad 100644
  --- a/arch/arc/include/asm/barrier.h
  +++ b/arch/arc/include/asm/barrier.h
  @@ -22,10 +22,12 @@
   /* TODO-vineetg verify the correctness of macros here */
   #ifdef CONFIG_SMP
   #define smp_mb()mb()
  +#define smp_tmb()  mb()
   #define smp_rmb()   rmb()
   #define smp_wmb()   wmb()
   #else
   #define smp_mb()barrier()
  +#define smp_tmb()  barrier()
   #define smp_rmb()   barrier()
   #define smp_wmb()   barrier()
   #endif
  diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
  index 60f15e274e6d..bc88a8505673 100644
  --- a/arch/arm/include/asm/barrier.h
  +++ b/arch/arm/include/asm/barrier.h
  @@ -51,10 +51,12 @@
   
   #ifndef CONFIG_SMP
   #define smp_mb()   barrier()
  +#define smp_tmb()  barrier()
   #define smp_rmb()  barrier()
   #define smp_wmb()  barrier()
   #else
   #define smp_mb()   dmb(ish)
  +#define smp_tmb()  smp_mb()
   #define smp_rmb()  smp_mb()
   #define smp_wmb()  dmb(ishst)
   #endif
  diff --git a/arch/arm64/include/asm/barrier.h 
  b/arch/arm64/include/asm/barrier.h
  index d4a63338a53c..ec0531f4892f 100644
  --- a/arch/arm64/include/asm/barrier.h
  +++ b/arch/arm64/include/asm/barrier.h
  @@ -33,10 +33,12 @@
   
   #ifndef CONFIG_SMP
   #define smp_mb()   barrier()
  +#define smp_tmb()  barrier()
   #define smp_rmb()  barrier()
   #define smp_wmb()  

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-03 Thread Linus Torvalds
On Sun, Nov 3, 2013 at 2:42 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 smp_storebuffer_mb() -- A barrier that enforces those orderings
 that do not invalidate the hardware store-buffer optimization.

Ugh. Maybe. Can you guarantee that those are the correct semantics?
And why talk about the hardware semantics, when you really want
specific semantics for the *software*.

 smp_not_w_r_mb() -- A barrier that orders everything except prior
 writes against subsequent reads.

Ok, that sounds more along the lines of these are the semantics we
want, but I have to say, it also doesn't make me go ahh, ok.

 smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
 semantics.  (C/C++ acquire orders a specific load against
 subsequent loads and stores, while C/C++ release orders
 a specific store against prior loads and stores.)

I don't think this is true. acquire+release is much stronger than what
you're looking for - it doesn't allow subsequent reads to move past
the write (because that would violate the acquire part). On x86, for
example, you'd need to have a locked cycle for smp_acqrel_mb().

So again, what are the guarantees you actually want? Describe those.
And then make a name.

I _think_ the guarantees you want is:
 - SMP write barrier
 - *local* read barrier for reads preceding the write.

but the problem is that the preceding reads part is really
specifically about the write that you had. The barrier should really
be attached to the *particular* write operation, it cannot be a
standalone barrier.

So it would *kind* of act like a smp_wmb() + smp_rmb(), but the
problem is that a smp_rmb() doesn't really attach to the preceding
write.

This is analogous to a acquire operation: you cannot make an
acquire barrier, because it's not a barrier *between* two ops, it's
associated with one particular op.

So what I *think* you actually really really want is a store with
release consistency, followed by a write barrier.

In TSO, afaik all stores have release consistency, and all writes are
ordered, which is why this is a no-op in TSO. And x86 also has that
all stores have release consistency, and all writes are ordered
model, even if TSO doesn't really describe the x86 model.

But on ARM64, for example, I think you'd really want the store itself
to be done with stlr (store with release), and then follow up with a
dsb st after that.

And notice how that requires you to mark the store itself. There is no
actual barrier *after* the store that does the optimized model.

Of course, it's entirely possible that it's not worth worrying about
this on ARM64, and that just doing it as a normal store followed by a
full memory barrier is good enough. But at least in *theory* a
microarchitecture might make it much cheaper to do a store with
release consistency followed by write barrier.

Anyway, having talked exhaustively about exactly what semantics you
are after, I *think* the best model would be to just have a

  #define smp_store_with_release_semantics(x, y) ...

and use that *and* a smp_wmb() for this (possibly a special
smp_wmb_after_release() if that allows people to avoid double
barriers). On x86 (and TSO systems), the
smp_store_with_release_semantics() would be just a regular store, and
the smp_wmb() is obviously a no-op. Other platforms would end up doing
other things.

Hmm?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/