Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread Alexander Duyck


On 11/14/2014 02:45 AM, David Laight wrote:

From: Alexander Duyck

It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

if (!rx_desc->wb.upper.status_error)
break;

/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * descriptor has been written back
 */
rmb();

With load_acquire() this can be changed to:

if (!load_acquire(_desc->wb.upper.status_error))
break;

If I'm quickly reading the 'new' code I need to look up yet another
function, with the 'old' code I can easily see the logic.

You've also added a memory barrier to the 'break' path - which isn't needed.

The driver might also have additional code that can be added before the barrier
so reducing the cost of the barrier.

The driver may also be able to perform multiple actions before a barrier is 
needed.

Hiding barriers isn't necessarily a good idea anyway.
If you are writing a driver you need to understand when and where they are 
needed.

Maybe you need a new (weaker) barrier to replace rmb() on some architectures.

...


David


Yeah, I think I might explore creating some lightweight barriers. The 
load/acquire stuff is a bit overkill for what is needed.


Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread Alexander Duyck


On 11/14/2014 02:19 AM, Will Deacon wrote:

Hi Alex,

On Thu, Nov 13, 2014 at 07:27:23PM +, Alexander Duyck wrote:

It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

 if (!rx_desc->wb.upper.status_error)
 break;

 /* This memory barrier is needed to keep us from reading
  * any other fields out of the rx_desc until we know the
  * descriptor has been written back
  */
 rmb();

With load_acquire() this can be changed to:

 if (!load_acquire(_desc->wb.upper.status_error))
 break;

I still don't think this is a good idea for the specific use-case you're
highlighting.

On ARM, an mb() can be *significantly* more expensive than an rmb() (since
we may have to drain store buffers on an outer L2 cache) and on arm64 it's
not at all clear that an LDAR is more efficient than an LDR; DMB LD
sequence. I can certainly imagine implementations where the latter would
be preferred.


Yeah, I am pretty sure I overdid it in using a mb() for arm.  I think 
what I should probably be using is something like dmb(ish) which is used 
for smp_mb() instead.  The general idea is to enforce memory-memory 
accesses.  The memory-mmio accesses still should be using a full 
rmb()/wmb() barrier.


The alternative I am mulling over is creating something like a 
lightweight set of memory barriers named lw_mb(), lw_rmb(), lw_wmb(), 
that could be used instead.  The general idea is that on many 
architectures a full mb/rmb/wmb is far too much for just guaranteeing 
ordering for system memory only writes or reads.  I'm thinking I could 
probably use the smp_ varieties as a template for them since I'm 
thinking that in most cases this should be correct.


Also, just to be clear I am not advocating replacing the wmb() in most 
I/O setups where we have to sync the system memory before doing the MMIO 
write.  This is for the case where the device descriptor ring has some 
bit indicating ownership by either the device or the CPU.  So for 
example on the r8169 they have to do a wmb() before writing the DescOwn 
bit in the first descriptor of a given set of Tx descriptors to 
guarantee the rest are written, then they set the DescOwn bit, then they 
call wmb() again to flush that last bit before notifying the device it 
can start fetching the descriptors. My goal is to deal with that first 
wmb() and leave the second as it since it is correct.



So, whilst I'm perfectly fine to go along with mandatory acquire/release
macros (we should probably add a check to barf on __iomem pointers), I
don't agree with using them in preference to finer-grained read/write
barriers. Doing so will have a real impact on I/O performance.


Couldn't that type of check be added to compiletime_assert_atomic_type?  
That seems like that would be the best place for something like that.




Finally, do you know of any architectures where load_acquire/store_release
aren't implemented the same way as the smp_* variants on SMP kernels?

Will


I should probably go back through and sort out the cases where mb() and 
smp_mb() are not the same thing.  I think I probably went with too harsh 
of a barrier in probably a couple of other cases.


Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread David Laight
From: Alexander Duyck
> It is common for device drivers to make use of acquire/release semantics
> when dealing with descriptors stored in device memory.  On reviewing the
> documentation and code for smp_load_acquire() and smp_store_release() as
> well as reviewing an IBM website that goes over the use of PowerPC barriers
> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
> occurred to me that the same code could likely be applied to device drivers.
> 
> As a result this patch introduces load_acquire() and store_release().  The
> load_acquire() function can be used in the place of situations where a test
> for ownership must be followed by a memory barrier.  The below example is
> from ixgbe:
> 
>   if (!rx_desc->wb.upper.status_error)
>   break;
> 
>   /* This memory barrier is needed to keep us from reading
>* any other fields out of the rx_desc until we know the
>* descriptor has been written back
>*/
>   rmb();
> 
> With load_acquire() this can be changed to:
> 
>   if (!load_acquire(_desc->wb.upper.status_error))
>   break;

If I'm quickly reading the 'new' code I need to look up yet another
function, with the 'old' code I can easily see the logic.

You've also added a memory barrier to the 'break' path - which isn't needed.

The driver might also have additional code that can be added before the barrier
so reducing the cost of the barrier.

The driver may also be able to perform multiple actions before a barrier is 
needed.

Hiding barriers isn't necessarily a good idea anyway.
If you are writing a driver you need to understand when and where they are 
needed.

Maybe you need a new (weaker) barrier to replace rmb() on some architectures.

...


David



Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread Will Deacon
Hi Alex,

On Thu, Nov 13, 2014 at 07:27:23PM +, Alexander Duyck wrote:
> It is common for device drivers to make use of acquire/release semantics
> when dealing with descriptors stored in device memory.  On reviewing the
> documentation and code for smp_load_acquire() and smp_store_release() as
> well as reviewing an IBM website that goes over the use of PowerPC barriers
> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
> occurred to me that the same code could likely be applied to device drivers.
> 
> As a result this patch introduces load_acquire() and store_release().  The
> load_acquire() function can be used in the place of situations where a test
> for ownership must be followed by a memory barrier.  The below example is
> from ixgbe:
> 
> if (!rx_desc->wb.upper.status_error)
> break;
> 
> /* This memory barrier is needed to keep us from reading
>  * any other fields out of the rx_desc until we know the
>  * descriptor has been written back
>  */
> rmb();
> 
> With load_acquire() this can be changed to:
> 
> if (!load_acquire(_desc->wb.upper.status_error))
> break;

I still don't think this is a good idea for the specific use-case you're
highlighting.

On ARM, an mb() can be *significantly* more expensive than an rmb() (since
we may have to drain store buffers on an outer L2 cache) and on arm64 it's
not at all clear that an LDAR is more efficient than an LDR; DMB LD
sequence. I can certainly imagine implementations where the latter would
be preferred.

So, whilst I'm perfectly fine to go along with mandatory acquire/release
macros (we should probably add a check to barf on __iomem pointers), I
don't agree with using them in preference to finer-grained read/write
barriers. Doing so will have a real impact on I/O performance.

Finally, do you know of any architectures where load_acquire/store_release
aren't implemented the same way as the smp_* variants on SMP kernels?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread Will Deacon
Hi Alex,

On Thu, Nov 13, 2014 at 07:27:23PM +, Alexander Duyck wrote:
 It is common for device drivers to make use of acquire/release semantics
 when dealing with descriptors stored in device memory.  On reviewing the
 documentation and code for smp_load_acquire() and smp_store_release() as
 well as reviewing an IBM website that goes over the use of PowerPC barriers
 at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
 occurred to me that the same code could likely be applied to device drivers.
 
 As a result this patch introduces load_acquire() and store_release().  The
 load_acquire() function can be used in the place of situations where a test
 for ownership must be followed by a memory barrier.  The below example is
 from ixgbe:
 
 if (!rx_desc-wb.upper.status_error)
 break;
 
 /* This memory barrier is needed to keep us from reading
  * any other fields out of the rx_desc until we know the
  * descriptor has been written back
  */
 rmb();
 
 With load_acquire() this can be changed to:
 
 if (!load_acquire(rx_desc-wb.upper.status_error))
 break;

I still don't think this is a good idea for the specific use-case you're
highlighting.

On ARM, an mb() can be *significantly* more expensive than an rmb() (since
we may have to drain store buffers on an outer L2 cache) and on arm64 it's
not at all clear that an LDAR is more efficient than an LDR; DMB LD
sequence. I can certainly imagine implementations where the latter would
be preferred.

So, whilst I'm perfectly fine to go along with mandatory acquire/release
macros (we should probably add a check to barf on __iomem pointers), I
don't agree with using them in preference to finer-grained read/write
barriers. Doing so will have a real impact on I/O performance.

Finally, do you know of any architectures where load_acquire/store_release
aren't implemented the same way as the smp_* variants on SMP kernels?

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread David Laight
From: Alexander Duyck
 It is common for device drivers to make use of acquire/release semantics
 when dealing with descriptors stored in device memory.  On reviewing the
 documentation and code for smp_load_acquire() and smp_store_release() as
 well as reviewing an IBM website that goes over the use of PowerPC barriers
 at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
 occurred to me that the same code could likely be applied to device drivers.
 
 As a result this patch introduces load_acquire() and store_release().  The
 load_acquire() function can be used in the place of situations where a test
 for ownership must be followed by a memory barrier.  The below example is
 from ixgbe:
 
   if (!rx_desc-wb.upper.status_error)
   break;
 
   /* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* descriptor has been written back
*/
   rmb();
 
 With load_acquire() this can be changed to:
 
   if (!load_acquire(rx_desc-wb.upper.status_error))
   break;

If I'm quickly reading the 'new' code I need to look up yet another
function, with the 'old' code I can easily see the logic.

You've also added a memory barrier to the 'break' path - which isn't needed.

The driver might also have additional code that can be added before the barrier
so reducing the cost of the barrier.

The driver may also be able to perform multiple actions before a barrier is 
needed.

Hiding barriers isn't necessarily a good idea anyway.
If you are writing a driver you need to understand when and where they are 
needed.

Maybe you need a new (weaker) barrier to replace rmb() on some architectures.

...


David



Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread Alexander Duyck


On 11/14/2014 02:19 AM, Will Deacon wrote:

Hi Alex,

On Thu, Nov 13, 2014 at 07:27:23PM +, Alexander Duyck wrote:

It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

 if (!rx_desc-wb.upper.status_error)
 break;

 /* This memory barrier is needed to keep us from reading
  * any other fields out of the rx_desc until we know the
  * descriptor has been written back
  */
 rmb();

With load_acquire() this can be changed to:

 if (!load_acquire(rx_desc-wb.upper.status_error))
 break;

I still don't think this is a good idea for the specific use-case you're
highlighting.

On ARM, an mb() can be *significantly* more expensive than an rmb() (since
we may have to drain store buffers on an outer L2 cache) and on arm64 it's
not at all clear that an LDAR is more efficient than an LDR; DMB LD
sequence. I can certainly imagine implementations where the latter would
be preferred.


Yeah, I am pretty sure I overdid it in using a mb() for arm.  I think 
what I should probably be using is something like dmb(ish) which is used 
for smp_mb() instead.  The general idea is to enforce memory-memory 
accesses.  The memory-mmio accesses still should be using a full 
rmb()/wmb() barrier.


The alternative I am mulling over is creating something like a 
lightweight set of memory barriers named lw_mb(), lw_rmb(), lw_wmb(), 
that could be used instead.  The general idea is that on many 
architectures a full mb/rmb/wmb is far too much for just guaranteeing 
ordering for system memory only writes or reads.  I'm thinking I could 
probably use the smp_ varieties as a template for them since I'm 
thinking that in most cases this should be correct.


Also, just to be clear I am not advocating replacing the wmb() in most 
I/O setups where we have to sync the system memory before doing the MMIO 
write.  This is for the case where the device descriptor ring has some 
bit indicating ownership by either the device or the CPU.  So for 
example on the r8169 they have to do a wmb() before writing the DescOwn 
bit in the first descriptor of a given set of Tx descriptors to 
guarantee the rest are written, then they set the DescOwn bit, then they 
call wmb() again to flush that last bit before notifying the device it 
can start fetching the descriptors. My goal is to deal with that first 
wmb() and leave the second as it since it is correct.



So, whilst I'm perfectly fine to go along with mandatory acquire/release
macros (we should probably add a check to barf on __iomem pointers), I
don't agree with using them in preference to finer-grained read/write
barriers. Doing so will have a real impact on I/O performance.


Couldn't that type of check be added to compiletime_assert_atomic_type?  
That seems like that would be the best place for something like that.




Finally, do you know of any architectures where load_acquire/store_release
aren't implemented the same way as the smp_* variants on SMP kernels?

Will


I should probably go back through and sort out the cases where mb() and 
smp_mb() are not the same thing.  I think I probably went with too harsh 
of a barrier in probably a couple of other cases.


Thanks,

Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-14 Thread Alexander Duyck


On 11/14/2014 02:45 AM, David Laight wrote:

From: Alexander Duyck

It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

if (!rx_desc-wb.upper.status_error)
break;

/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * descriptor has been written back
 */
rmb();

With load_acquire() this can be changed to:

if (!load_acquire(rx_desc-wb.upper.status_error))
break;

If I'm quickly reading the 'new' code I need to look up yet another
function, with the 'old' code I can easily see the logic.

You've also added a memory barrier to the 'break' path - which isn't needed.

The driver might also have additional code that can be added before the barrier
so reducing the cost of the barrier.

The driver may also be able to perform multiple actions before a barrier is 
needed.

Hiding barriers isn't necessarily a good idea anyway.
If you are writing a driver you need to understand when and where they are 
needed.

Maybe you need a new (weaker) barrier to replace rmb() on some architectures.

...


David


Yeah, I think I might explore creating some lightweight barriers. The 
load/acquire stuff is a bit overkill for what is needed.


Thanks,

Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-13 Thread Alexander Duyck
It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

if (!rx_desc->wb.upper.status_error)
break;

/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * descriptor has been written back
 */
rmb();

With load_acquire() this can be changed to:

if (!load_acquire(_desc->wb.upper.status_error))
break;

A similar change can be made in the release path of many drivers.  For
example in the Realtek r8169 driver there are a number of flows that
consist of something like the following:

wmb();

status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
txd->opts1 = cpu_to_le32(status);

tp->cur_tx += frags + 1;

wmb();

With store_release() this can be changed to the following:

status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
store_release(>opts1, cpu_to_le32(status));

tp->cur_tx += frags + 1;

wmb();

The resulting assembler code generated as a result can be significantly
less expensive on architectures such as x86 and s390 that support strong
ordering.  On architectures that are able to use different primitives than
their rmb/wmb() such as powerpc, ia64, and arm64 we should see gains as we
are able to use less expensive barriers, and for other architectures we end
up using a mb() which may come at the same amount of overhead or more than
a rmb/wmb() as we must ensure Load/Store ordering.

Cc: Benjamin Herrenschmidt 
Cc: Frederic Weisbecker 
Cc: Mathieu Desnoyers 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Russell King 
Cc: Geert Uytterhoeven 
Cc: Heiko Carstens 
Cc: Linus Torvalds 
Cc: Martin Schwidefsky 
Cc: Tony Luck 
Cc: Oleg Nesterov 
Cc: Will Deacon 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: David Miller 
Signed-off-by: Alexander Duyck 
---
 arch/arm/include/asm/barrier.h  |   15 +
 arch/arm64/include/asm/barrier.h|   59 ++-
 arch/ia64/include/asm/barrier.h |7 +++-
 arch/metag/include/asm/barrier.h|   15 +
 arch/mips/include/asm/barrier.h |   15 +
 arch/powerpc/include/asm/barrier.h  |   24 +++---
 arch/s390/include/asm/barrier.h |7 +++-
 arch/sparc/include/asm/barrier_64.h |6 ++--
 arch/x86/include/asm/barrier.h  |   22 -
 include/asm-generic/barrier.h   |   15 +
 10 files changed, 144 insertions(+), 41 deletions(-)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..bbdcd34 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,21 @@
 #define smp_wmb()  dmb(ishst)
 #endif
 
+#define store_release(p, v)\
+do {   \
+   compiletime_assert_atomic_type(*p); \
+   mb();   \
+   ACCESS_ONCE(*p) = (v);  \
+} while (0)
+
+#define load_acquire(p)
\
+({ \
+   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+   compiletime_assert_atomic_type(*p); \
+   mb();   \
+   ___p1;  \
+})
+
 #define smp_store_release(p, v)
\
 do {   \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..c91571c 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -32,33 +32,7 @@
 #define rmb()  dsb(ld)
 #define wmb()  dsb(st)
 
-#ifndef CONFIG_SMP
-#define smp_mb()   barrier()
-#define smp_rmb()  barrier()
-#define smp_wmb()  barrier()
-
-#define smp_store_release(p, v)
\
-do {   

[PATCH 1/3] arch: Introduce load_acquire() and store_release()

2014-11-13 Thread Alexander Duyck
It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory.  On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release().  The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier.  The below example is
from ixgbe:

if (!rx_desc-wb.upper.status_error)
break;

/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * descriptor has been written back
 */
rmb();

With load_acquire() this can be changed to:

if (!load_acquire(rx_desc-wb.upper.status_error))
break;

A similar change can be made in the release path of many drivers.  For
example in the Realtek r8169 driver there are a number of flows that
consist of something like the following:

wmb();

status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
txd-opts1 = cpu_to_le32(status);

tp-cur_tx += frags + 1;

wmb();

With store_release() this can be changed to the following:

status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
store_release(txd-opts1, cpu_to_le32(status));

tp-cur_tx += frags + 1;

wmb();

The resulting assembler code generated as a result can be significantly
less expensive on architectures such as x86 and s390 that support strong
ordering.  On architectures that are able to use different primitives than
their rmb/wmb() such as powerpc, ia64, and arm64 we should see gains as we
are able to use less expensive barriers, and for other architectures we end
up using a mb() which may come at the same amount of overhead or more than
a rmb/wmb() as we must ensure Load/Store ordering.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Frederic Weisbecker fweis...@gmail.com
Cc: Mathieu Desnoyers mathieu.desnoy...@polymtl.ca
Cc: Michael Ellerman mich...@ellerman.id.au
Cc: Michael Neuling mi...@neuling.org
Cc: Russell King li...@arm.linux.org.uk
Cc: Geert Uytterhoeven ge...@linux-m68k.org
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Tony Luck tony.l...@intel.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Will Deacon will.dea...@arm.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: David Miller da...@davemloft.net
Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com
---
 arch/arm/include/asm/barrier.h  |   15 +
 arch/arm64/include/asm/barrier.h|   59 ++-
 arch/ia64/include/asm/barrier.h |7 +++-
 arch/metag/include/asm/barrier.h|   15 +
 arch/mips/include/asm/barrier.h |   15 +
 arch/powerpc/include/asm/barrier.h  |   24 +++---
 arch/s390/include/asm/barrier.h |7 +++-
 arch/sparc/include/asm/barrier_64.h |6 ++--
 arch/x86/include/asm/barrier.h  |   22 -
 include/asm-generic/barrier.h   |   15 +
 10 files changed, 144 insertions(+), 41 deletions(-)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..bbdcd34 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,21 @@
 #define smp_wmb()  dmb(ishst)
 #endif
 
+#define store_release(p, v)\
+do {   \
+   compiletime_assert_atomic_type(*p); \
+   mb();   \
+   ACCESS_ONCE(*p) = (v);  \
+} while (0)
+
+#define load_acquire(p)
\
+({ \
+   typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+   compiletime_assert_atomic_type(*p); \
+   mb();   \
+   ___p1;  \
+})
+
 #define smp_store_release(p, v)
\
 do {   \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..c91571c