Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-10-17 Thread David Marchand
On Wed, Oct 16, 2019 at 11:04 AM Phil Yang (Arm Technology China)
 wrote:
>
> > -Original Message-
> > From: David Marchand 
> > Sent: Tuesday, October 15, 2019 8:16 PM
> > To: Phil Yang (Arm Technology China) 
> > Cc: tho...@monjalon.net; jer...@marvell.com; Gage Eads
> > ; dev ; hemant.agra...@nxp.com;
> > Honnappa Nagarahalli ; Gavin Hu (Arm
> > Technology China) ; nd 
> > Subject: Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic
> > compare exchange
> >
> > On Tue, Oct 15, 2019 at 1:32 PM Phil Yang (Arm Technology China)
> >  wrote:
> > > > -Original Message-
> > > > From: David Marchand 
> > > > If LSE is available, we expose __rte_cas_XX (explicitely) *non*
> > > > inlined functions, while without LSE, we expose inlined __rte_ldr_XX
> > > > and __rte_stx_XX functions.
> > > > So we have a first disparity with non-inlined vs inlined functions
> > > > depending on a #ifdef.
> >
> > You did not comment on the inline / no inline part and I still see
> > this in the v10.
> > Is this __rte_noinline on the CAS function intentional?
>
> Apologize for missing this item. Yes, it is to avoid ABI break.
> Please check
> 5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix possible 
> arm64 ABI break")

Looked at the kernel parts on LSE CAS (thanks for the pointer) but I
see inlines are used:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/atomic_lse.h#n365?h=v5.4-rc3

What is special in the kernel or in dpdk that makes this different?


>
> >
> >
> > > > Then, we have a second disparity with two sets of "apis" depending on
> > > > this #ifdef.
> > > >
> > > > And we expose those sets with a rte_ prefix, meaning people will try
> > > > to use them, but those are not part of a public api.
> > > >
> > > > Can't we do without them ? (see below [2] for a proposal with ldr/stx,
> > > > cas should be the same)
> > >
> > > No, it doesn't work.
> > > Because we need to verify the return value at the end of the loop for 
> > > these
> > macros.
> >
> > Do you mean the return value for the stores?
>
> It is my bad. I missed the ret option in the macro. This approach works.

Ok, thanks for confirming.


>
> However, I suggest to keep them as static inline functions rather than a 
> piece of macro in the rte_atomic128_cmp_exchange API.
> One reason is APIs name can indicate the memory ordering of these operations.

API?
Those inlines are not part of a public API and we agree this patch is
not about adding 128 bits load/store apis.

My proposal gives us small code that looks like:
if (ldx_mo == __ATOMIC_RELAXED)
__READ_128("ldxp", dst, old);
else
__READ_128("ldaxp", dst, old);

I am not a memory order guru, but with this, I can figure the asm
instruction depends on it.
And, since we are looking at internals of an implementation, this is
mainly for people looking at/maintaining these low level details.


> Moreover, it uses the register type to pass the value in the inline function, 
> so it should not have too much cost comparing with the macro.

This is not a problem of cost, this is about hiding architecture
details from the final user.
If you expose something, you can expect someone will start using it
and will complain later if you break it.


> I also think these 128bit load and store functions can be used in other 
> places, once it has been proved valuable in rte_atomic128_cmp_exchange API. 
> But let's keep them private for the current stage.

Yes I agree this could be introduced in the future.


> BTW, Linux kernel implemented in the same way. 
> https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/atomic_lse.h#L19

Ok kernel exposes its internals, but I think kernel developpers are
more vigilant than dpdk developpers on what is part of the public API
and what is internal.


--
David Marchand



Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-10-16 Thread Phil Yang (Arm Technology China)
> -Original Message-
> From: David Marchand 
> Sent: Tuesday, October 15, 2019 8:16 PM
> To: Phil Yang (Arm Technology China) 
> Cc: tho...@monjalon.net; jer...@marvell.com; Gage Eads
> ; dev ; hemant.agra...@nxp.com;
> Honnappa Nagarahalli ; Gavin Hu (Arm
> Technology China) ; nd 
> Subject: Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic
> compare exchange
> 
> On Tue, Oct 15, 2019 at 1:32 PM Phil Yang (Arm Technology China)
>  wrote:
> > > -Original Message-
> > > From: David Marchand 
> > > If LSE is available, we expose __rte_cas_XX (explicitely) *non*
> > > inlined functions, while without LSE, we expose inlined __rte_ldr_XX
> > > and __rte_stx_XX functions.
> > > So we have a first disparity with non-inlined vs inlined functions
> > > depending on a #ifdef.
> 
> You did not comment on the inline / no inline part and I still see
> this in the v10.
> Is this __rte_noinline on the CAS function intentional?

Apologize for missing this item. Yes, it is to avoid ABI break.
Please check
5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix possible 
arm64 ABI break")

> 
> 
> > > Then, we have a second disparity with two sets of "apis" depending on
> > > this #ifdef.
> > >
> > > And we expose those sets with a rte_ prefix, meaning people will try
> > > to use them, but those are not part of a public api.
> > >
> > > Can't we do without them ? (see below [2] for a proposal with ldr/stx,
> > > cas should be the same)
> >
> > No, it doesn't work.
> > Because we need to verify the return value at the end of the loop for these
> macros.
> 
> Do you mean the return value for the stores?

It is my bad. I missed the ret option in the macro. This approach works.

However, I suggest to keep them as static inline functions rather than a piece 
of macro in the rte_atomic128_cmp_exchange API.
One reason is APIs name can indicate the memory ordering of these operations.
Moreover, it uses the register type to pass the value in the inline function, 
so it should not have too much cost comparing with the macro.
I also think these 128bit load and store functions can be used in other places, 
once it has been proved valuable in rte_atomic128_cmp_exchange API. But let's 
keep them private for the current stage.
BTW, Linux kernel implemented in the same way. 
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/atomic_lse.h#L19
 
 
> > > #define __STORE_128(op_string, dst, val, ret) \
> > > asm volatile(\
> > > op_string " %w0, %1, %2, %3" \
> > > : "=&r" (ret)\
> > > : "r" (val.val[0]),  \
> > >   "r" (val.val[1]),  \
> > >   "Q" (dst->val[0])  \
> > > : "memory")
> 
> The ret variable is still passed in this macro and the while loop can
> check it later.
> 
> 
> > > > diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h
> > > b/lib/librte_eal/common/include/generic/rte_atomic.h
> > > > index 24ff7dc..e6ab15a 100644
> > > > --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> > > > +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> > > > @@ -1081,6 +1081,20 @@ static inline void
> > > rte_atomic64_clear(rte_atomic64_t *v)
> > > >
> > > >  /* 128 bit atomic operations 
> > > > -*/
> > > >
> > > > +/**
> > > > + * 128-bit integer structure.
> > > > + */
> > > > +RTE_STD_C11
> > > > +typedef struct {
> > > > +   RTE_STD_C11
> > > > +   union {
> > > > +   uint64_t val[2];
> > > > +#ifdef RTE_ARCH_64
> > > > +   __extension__ __int128 int128;
> > > > +#endif
> > >
> > > You hid this field for x86.
> > > What is the reason?
> > No, we are not hid it for x86. The RTE_ARCH_64 flag covered x86 as well.
> 
> Ah indeed, I read it wrong, ARCH_64 ... AARCH64 ... :-)
> 
> 
> 
> --
> David Marchand



Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-10-15 Thread David Marchand
On Tue, Oct 15, 2019 at 1:32 PM Phil Yang (Arm Technology China)
 wrote:
> > -Original Message-
> > From: David Marchand 
> > If LSE is available, we expose __rte_cas_XX (explicitely) *non*
> > inlined functions, while without LSE, we expose inlined __rte_ldr_XX
> > and __rte_stx_XX functions.
> > So we have a first disparity with non-inlined vs inlined functions
> > depending on a #ifdef.

You did not comment on the inline / no inline part and I still see
this in the v10.
Is this __rte_noinline on the CAS function intentional?


> > Then, we have a second disparity with two sets of "apis" depending on
> > this #ifdef.
> >
> > And we expose those sets with a rte_ prefix, meaning people will try
> > to use them, but those are not part of a public api.
> >
> > Can't we do without them ? (see below [2] for a proposal with ldr/stx,
> > cas should be the same)
>
> No, it doesn't work.
> Because we need to verify the return value at the end of the loop for these 
> macros.

Do you mean the return value for the stores?

> > #define __STORE_128(op_string, dst, val, ret) \
> > asm volatile(\
> > op_string " %w0, %1, %2, %3" \
> > : "=&r" (ret)\
> > : "r" (val.val[0]),  \
> >   "r" (val.val[1]),  \
> >   "Q" (dst->val[0])  \
> > : "memory")

The ret variable is still passed in this macro and the while loop can
check it later.


> > > diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h
> > b/lib/librte_eal/common/include/generic/rte_atomic.h
> > > index 24ff7dc..e6ab15a 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> > > @@ -1081,6 +1081,20 @@ static inline void
> > rte_atomic64_clear(rte_atomic64_t *v)
> > >
> > >  /* 128 bit atomic operations 
> > > -*/
> > >
> > > +/**
> > > + * 128-bit integer structure.
> > > + */
> > > +RTE_STD_C11
> > > +typedef struct {
> > > +   RTE_STD_C11
> > > +   union {
> > > +   uint64_t val[2];
> > > +#ifdef RTE_ARCH_64
> > > +   __extension__ __int128 int128;
> > > +#endif
> >
> > You hid this field for x86.
> > What is the reason?
> No, we are not hid it for x86. The RTE_ARCH_64 flag covered x86 as well.

Ah indeed, I read it wrong, ARCH_64 ... AARCH64 ... :-)



--
David Marchand



Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-10-15 Thread Phil Yang (Arm Technology China)
Hi David,

Thanks for your comments. I have addressed most of them in v10.  Please review 
it.
Some comments inline.
 
> -Original Message-
> From: David Marchand 
> Sent: Monday, October 14, 2019 11:44 PM
> To: Phil Yang (Arm Technology China) 
> Cc: tho...@monjalon.net; jer...@marvell.com; Gage Eads
> ; dev ; hemant.agra...@nxp.com;
> Honnappa Nagarahalli ; Gavin Hu (Arm
> Technology China) ; nd 
> Subject: Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic
> compare exchange
> 
> On Wed, Aug 14, 2019 at 10:29 AM Phil Yang  wrote:
> >
> > Add 128-bit atomic compare exchange on aarch64.
> 
> A bit short, seeing the complexity of the code and the additional
> RTE_ARM_FEATURE_ATOMICS config flag.
Updated in v10. 



> >
> > +/* 128 bit atomic operations 
> > -*/
> > +
> > +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> __ATOMIC_RELEASE)
> > +#define __HAS_RLS(mo) ((mo) == __ATOMIC_RELEASE || (mo) ==
> __ATOMIC_ACQ_REL || \
> > + (mo) == __ATOMIC_SEQ_CST)
> > +
> > +#define __MO_LOAD(mo)  (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE :
> __ATOMIC_RELAXED)
> > +#define __MO_STORE(mo) (__HAS_RLS((mo)) ? __ATOMIC_RELEASE :
> __ATOMIC_RELAXED)
> 
> Those 4 first macros only make sense when LSE is not available (see below
> [1]).
> Besides, they are used only once, why not directly use those
> conditions where needed?

Agree. I removed __MO_LOAD and __MO_STORE in v10 and kept the __HAS_ACQ and 
__HAS_REL under the non-LSE condition branch in v10. 
I think they can make the code easy to read.

> 
> 
> > +
> > +#if defined(__ARM_FEATURE_ATOMICS) ||
> defined(RTE_ARM_FEATURE_ATOMICS)
> > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string) 
> >  \
> > +static __rte_noinline rte_int128_t 
> >  \
> > +cas_op_name(rte_int128_t *dst, rte_int128_t old,   
> >  \
> > +   rte_int128_t updated)   
> > \
> > +{  
> >  \
> > +   /* caspX instructions register pair must start from even-numbered
> > +* register at operand 1.
> > +* So, specify registers for local variables here.
> > +*/ 
> > \
> > +   register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];
> > \
> > +   register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];
> > \
> > +   register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];
> > \
> > +   register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];
> > \
> > +   asm volatile(   
> > \
> > +   op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]"   
> > \
> > +   : [old0] "+r" (x0), 
> > \
> > +   [old1] "+r" (x1)
> > \
> > +   : [upd0] "r" (x2),  
> > \
> > +   [upd1] "r" (x3),
> > \
> > +   [dst] "r" (dst) 
> > \
> > +   : "memory");
> > \
> > +   old.val[0] = x0;
> > \
> > +   old.val[1] = x1;
> > \
> > +   return old; 
> > \
> > +}
> > +
> > +__ATOMIC128_CAS_OP(__rte_cas_relaxed, "casp")
> > +__ATOMIC128_CAS_OP(__rte_cas_acquire, "caspa")
> > +__ATOMIC128_CAS_OP(__rte_cas_release, "caspl")
> > +__ATOMIC128_CAS_OP(__rte_cas_acq_rel, "caspal")
> 
> If LSE is available, we expose __rte_cas_XX (explicitely) *non*
> inlined functions, while without LSE, we expose inlined __rte_ldr_XX
> and __rte_stx_XX functions.
> So we have a first disparity with non-inlined vs inlined functions
> depending on a #ifdef.
> Then, we 

Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-10-14 Thread David Marchand
On Wed, Aug 14, 2019 at 10:29 AM Phil Yang  wrote:
>
> Add 128-bit atomic compare exchange on aarch64.

A bit short, seeing the complexity of the code and the additional
RTE_ARM_FEATURE_ATOMICS config flag.


Comments inline.

>
> Suggested-by: Jerin Jacob 
> Signed-off-by: Phil Yang 
> Reviewed-by: Honnappa Nagarahalli 
> Tested-by: Honnappa Nagarahalli 
> Acked-by: Jerin Jacob 
> ---
>
> v9:
> Updated 19.11 release note.
>
> v8:
> Fixed "WARNING:LONG_LINE: line over 80 characters" warnings with latest kernel
> checkpatch.pl
>
> v7:
> 1. Adjust code comment.
>
> v6:
> 1. Put the RTE_ARM_FEATURE_ATOMICS flag into EAL group. (Jerin Jocob)
> 2. Keep rte_stack_lf_stubs.h doing nothing. (Gage Eads)
> 3. Fixed 32 bit build issue.
>
> v5:
> 1. Enable RTE_ARM_FEATURE_ATOMICS on octeontx2 in default. (Jerin Jocob)
> 2. Record the reason of introducing "rte_stack_lf_stubs.h" in git
> commit.
> (Jerin, Jocob)
> 3. Fixed a conditional MACRO error in rte_atomic128_cmp_exchange. (Jerin
> Jocob)
>
> v4:
> 1. Add RTE_ARM_FEATURE_ATOMICS flag to support LSE CASP instructions.
> (Jerin Jocob)
> 2. Fix possible arm64 ABI break by making casp_op_name noinline. (Jerin
> Jocob)
> 3. Add rte_stack_lf_stubs.h to reduce the ifdef clutter. (Gage
> Eads/Jerin Jocob)
>
> v3:
> 1. Avoid duplication code with macro. (Jerin Jocob)
> 2. Make invalid memory order to strongest barrier. (Jerin Jocob)
> 3. Update doc/guides/prog_guide/env_abstraction_layer.rst. (Gage Eads)
> 4. Fix 32-bit x86 builds issue. (Gage Eads)
> 5. Correct documentation issues in UT. (Gage Eads)
>
> v2:
> Initial version.
>
>  config/arm/meson.build |   2 +
>  config/common_base |   3 +
>  config/defconfig_arm64-octeontx2-linuxapp-gcc  |   1 +
>  config/defconfig_arm64-thunderx2-linuxapp-gcc  |   1 +
>  .../common/include/arch/arm/rte_atomic_64.h| 163 
> +
>  .../common/include/arch/x86/rte_atomic_64.h|  12 --
>  lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
>  7 files changed, 186 insertions(+), 13 deletions(-)
>
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..9f28271 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -71,11 +71,13 @@ flags_thunderx2_extra = [
> ['RTE_CACHE_LINE_SIZE', 64],
> ['RTE_MAX_NUMA_NODES', 2],
> ['RTE_MAX_LCORE', 256],
> +   ['RTE_ARM_FEATURE_ATOMICS', true],
> ['RTE_USE_C11_MEM_MODEL', true]]
>  flags_octeontx2_extra = [
> ['RTE_MACHINE', '"octeontx2"'],
> ['RTE_MAX_NUMA_NODES', 1],
> ['RTE_MAX_LCORE', 24],
> +   ['RTE_ARM_FEATURE_ATOMICS', true],
> ['RTE_EAL_IGB_UIO', false],
> ['RTE_USE_C11_MEM_MODEL', true]]
>
> diff --git a/config/common_base b/config/common_base
> index 8ef75c2..2054480 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -82,6 +82,9 @@ CONFIG_RTE_MAX_LCORE=128
>  CONFIG_RTE_MAX_NUMA_NODES=8
>  CONFIG_RTE_MAX_HEAPS=32
>  CONFIG_RTE_MAX_MEMSEG_LISTS=64
> +
> +# Use ARM LSE ATOMIC instructions
> +CONFIG_RTE_ARM_FEATURE_ATOMICS=n
>  # each memseg list will be limited to either RTE_MAX_MEMSEG_PER_LIST pages
>  # or RTE_MAX_MEM_MB_PER_LIST megabytes worth of memory, whichever is smaller
>  CONFIG_RTE_MAX_MEMSEG_PER_LIST=8192
> diff --git a/config/defconfig_arm64-octeontx2-linuxapp-gcc 
> b/config/defconfig_arm64-octeontx2-linuxapp-gcc
> index f20da24..7687dbe 100644
> --- a/config/defconfig_arm64-octeontx2-linuxapp-gcc
> +++ b/config/defconfig_arm64-octeontx2-linuxapp-gcc
> @@ -9,6 +9,7 @@ CONFIG_RTE_MACHINE="octeontx2"
>  CONFIG_RTE_CACHE_LINE_SIZE=128
>  CONFIG_RTE_MAX_NUMA_NODES=1
>  CONFIG_RTE_MAX_LCORE=24
> +CONFIG_RTE_ARM_FEATURE_ATOMICS=y
>
>  # Doesn't support NUMA
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> diff --git a/config/defconfig_arm64-thunderx2-linuxapp-gcc 
> b/config/defconfig_arm64-thunderx2-linuxapp-gcc
> index cc5c64b..af4a89c 100644
> --- a/config/defconfig_arm64-thunderx2-linuxapp-gcc
> +++ b/config/defconfig_arm64-thunderx2-linuxapp-gcc
> @@ -9,3 +9,4 @@ CONFIG_RTE_MACHINE="thunderx2"
>  CONFIG_RTE_CACHE_LINE_SIZE=64
>  CONFIG_RTE_MAX_NUMA_NODES=2
>  CONFIG_RTE_MAX_LCORE=256
> +CONFIG_RTE_ARM_FEATURE_ATOMICS=y
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
> b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 97060e4..14d869b 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2015 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_ATOMIC_ARM64_H_
> @@ -14,6 +15,9 @@ extern "C" {
>  #endif
>
>  #include "generic/rte_atomic.h"
> +#include 
> +#include 
> +#include 
>
>  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
>  #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
> @@ -40,6 +44,165 @@ extern "C" {

Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-08-14 Thread Jerin Jacob Kollanukkaran
> -Original Message-
> From: Phil Yang (Arm Technology China) 
> Sent: Wednesday, August 14, 2019 3:55 PM
> To: Jerin Jacob Kollanukkaran ; tho...@monjalon.net;
> gage.e...@intel.com; dev@dpdk.org
> Cc: hemant.agra...@nxp.com; Honnappa Nagarahalli
> ; Gavin Hu (Arm Technology China)
> ; nd ; nd 
> Subject: [EXT] RE: [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> External Email
> 
> --
> > -Original Message-
> > From: Jerin Jacob Kollanukkaran 
> > Sent: Wednesday, August 14, 2019 4:46 PM
> > To: Phil Yang (Arm Technology China) ;
> > tho...@monjalon.net; gage.e...@intel.com; dev@dpdk.org
> > Cc: hemant.agra...@nxp.com; Honnappa Nagarahalli
> > ; Gavin Hu (Arm Technology China)
> > ; nd 
> > Subject: RE: [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare
> > exchange
> >
> > > -Original Message-
> > > From: Phil Yang 
> > > Sent: Wednesday, August 14, 2019 1:58 PM
> > > To: tho...@monjalon.net; Jerin Jacob Kollanukkaran
> > ;
> > > gage.e...@intel.com; dev@dpdk.org
> > > Cc: hemant.agra...@nxp.com; honnappa.nagaraha...@arm.com;
> > > gavin...@arm.com; n...@arm.com
> > > Subject: [EXT] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare
> > > exchange
> > > +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > > +__ATOMIC_RELEASE) #define __HAS_RLS(mo) ((mo) ==
> > > __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || \
> > > +   (mo) == __ATOMIC_SEQ_CST)
> > > +
> > > +#define __MO_LOAD(mo)  (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE :
> > > +__ATOMIC_RELAXED) #define __MO_STORE(mo) (__HAS_RLS((mo)) ?
> > > +__ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > > +
> > > +#if defined(__ARM_FEATURE_ATOMICS) ||
> > > defined(RTE_ARM_FEATURE_ATOMICS)
> > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
> \
> > > +static __rte_noinline rte_int128_t   
> > >\
> >
> >
> > Could you check the cost of making it as __rte_noinline?
> > If it is costly, How about having two versions, one with
> > __rte_noinline to make compliance with arm64 procedure call standard
> > for old gcc and clang.
> > Other one without explicit register hardcoding + inline for latest gcc
> 
> Hi Jerin,

Hi Phil Yang,

> According to the stack_lf_perf_autotest, making it as __rte_noinline has no
> overhead on ThunderX2 with GCC 8.3.
> The 'Average cycles per object push/pop' numbers for __rte_noinline and
> __rte_always_inline versions are nearly the same.

I tested with octeontx2 as well. It is yielding similar result. 
No change is expected in this patch then.



Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-08-14 Thread Phil Yang (Arm Technology China)
> -Original Message-
> From: Jerin Jacob Kollanukkaran 
> Sent: Wednesday, August 14, 2019 4:46 PM
> To: Phil Yang (Arm Technology China) ;
> tho...@monjalon.net; gage.e...@intel.com; dev@dpdk.org
> Cc: hemant.agra...@nxp.com; Honnappa Nagarahalli
> ; Gavin Hu (Arm Technology China)
> ; nd 
> Subject: RE: [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange
> 
> > -Original Message-
> > From: Phil Yang 
> > Sent: Wednesday, August 14, 2019 1:58 PM
> > To: tho...@monjalon.net; Jerin Jacob Kollanukkaran
> ;
> > gage.e...@intel.com; dev@dpdk.org
> > Cc: hemant.agra...@nxp.com; honnappa.nagaraha...@arm.com;
> > gavin...@arm.com; n...@arm.com
> > Subject: [EXT] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare
> > exchange
> > +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > +__ATOMIC_RELEASE) #define __HAS_RLS(mo) ((mo) ==
> > __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || \
> > + (mo) == __ATOMIC_SEQ_CST)
> > +
> > +#define __MO_LOAD(mo)  (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE :
> > +__ATOMIC_RELAXED) #define __MO_STORE(mo) (__HAS_RLS((mo)) ?
> > +__ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > +
> > +#if defined(__ARM_FEATURE_ATOMICS) ||
> > defined(RTE_ARM_FEATURE_ATOMICS)
> > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string) 
> >  \
> > +static __rte_noinline rte_int128_t 
> >  \
> 
> 
> Could you check the cost of making it as __rte_noinline?
> If it is costly, How about having two versions, one with __rte_noinline
> to make compliance with arm64 procedure call standard for
> old gcc and clang.
> Other one without explicit register hardcoding + inline for latest
> gcc

Hi Jerin,

According to the stack_lf_perf_autotest, making it as __rte_noinline has no 
overhead on ThunderX2 with GCC 8.3.
The 'Average cycles per object push/pop' numbers for __rte_noinline and 
__rte_always_inline versions are nearly the same.

Test results :
## Two NUMA Node ##
 __rte_noinline 

RTE>>stack_lf_perf_autotest

### Testing using two NUMA nodes ###
Average cycles per object push/pop (bulk size: 8): 24.10
Average cycles per object push/pop (bulk size: 32): 6.85

### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 680.39
Average cycles per object push/pop (bulk size: 32): 146.38
Test OK

 __rte_always-inline 
RTE>>stack_lf_perf_autotest

### Testing using two NUMA nodes ###
Average cycles per object push/pop (bulk size: 8): 24.29
Average cycles per object push/pop (bulk size: 32): 6.92

### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 683.92
Average cycles per object push/pop (bulk size: 32): 145.11
Test OK

## Single NUMA ##
 __rte_always-inline 

RTE>>stack_lf_perf_autotest

### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 582.92
Average cycles per object push/pop (bulk size: 32): 125.57
Test OK
 __rte_noinline 

RTE>>stack_lf_perf_autotest

### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 537.56
Average cycles per object push/pop (bulk size: 32): 122.98
Test OK

Thanks,
Phil Yang

> 
> 
> > +cas_op_name(rte_int128_t *dst, rte_int128_t old,   
> >  \
> > +   rte_int128_t updated)   \
> > +{  
> >  \
> > +   /* caspX instructions register pair must start from even-numbered
> > +* register at operand 1.
> > +* So, specify registers for local variables here.
> > +*/ \
> > +   register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];\
> > +   register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];\
> > +   register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];\
> > +   register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];\
> > +   asm volatile(   \
> > +   op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]"   \
> > +   : [old0] "+r" (x0), \
> > +   [old1] "+r" (x1)\
> > +   : [upd0] "r" (x2),  \
> > +   [upd1] "r" (x3),\
> > +   [dst] "r" (dst) \
> > +   : "memory");\
> > +   old.val[0] = x0;\
> > +   old.val[1] = x1;\
> > +   return old; \
> > +}
> > +


Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-08-14 Thread Jerin Jacob Kollanukkaran
> -Original Message-
> From: Phil Yang 
> Sent: Wednesday, August 14, 2019 1:58 PM
> To: tho...@monjalon.net; Jerin Jacob Kollanukkaran ;
> gage.e...@intel.com; dev@dpdk.org
> Cc: hemant.agra...@nxp.com; honnappa.nagaraha...@arm.com;
> gavin...@arm.com; n...@arm.com
> Subject: [EXT] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> +__ATOMIC_RELEASE) #define __HAS_RLS(mo) ((mo) ==
> __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || \
> +   (mo) == __ATOMIC_SEQ_CST)
> +
> +#define __MO_LOAD(mo)  (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE :
> +__ATOMIC_RELAXED) #define __MO_STORE(mo) (__HAS_RLS((mo)) ?
> +__ATOMIC_RELEASE : __ATOMIC_RELAXED)
> +
> +#if defined(__ARM_FEATURE_ATOMICS) ||
> defined(RTE_ARM_FEATURE_ATOMICS)
> +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)  \
> +static __rte_noinline rte_int128_t  \


Could you check the cost of making it as __rte_noinline?
If it is costly, How about having two versions, one with __rte_noinline
to make compliance with arm64 procedure call standard for
old gcc and clang.
Other one without explicit register hardcoding + inline for latest
gcc


> +cas_op_name(rte_int128_t *dst, rte_int128_t old,\
> + rte_int128_t updated)   \
> +{   \
> + /* caspX instructions register pair must start from even-numbered
> +  * register at operand 1.
> +  * So, specify registers for local variables here.
> +  */ \
> + register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];\
> + register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];\
> + register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];\
> + register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];\
> + asm volatile(   \
> + op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]"   \
> + : [old0] "+r" (x0), \
> + [old1] "+r" (x1)\
> + : [upd0] "r" (x2),  \
> + [upd1] "r" (x3),\
> + [dst] "r" (dst) \
> + : "memory");\
> + old.val[0] = x0;\
> + old.val[1] = x1;\
> + return old; \
> +}
> +


[dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

2019-08-14 Thread Phil Yang
Add 128-bit atomic compare exchange on aarch64.

Suggested-by: Jerin Jacob 
Signed-off-by: Phil Yang 
Reviewed-by: Honnappa Nagarahalli 
Tested-by: Honnappa Nagarahalli 
Acked-by: Jerin Jacob 
---

v9:
Updated 19.11 release note.

v8:
Fixed "WARNING:LONG_LINE: line over 80 characters" warnings with latest kernel
checkpatch.pl

v7:
1. Adjust code comment.

v6:
1. Put the RTE_ARM_FEATURE_ATOMICS flag into EAL group. (Jerin Jocob)
2. Keep rte_stack_lf_stubs.h doing nothing. (Gage Eads)
3. Fixed 32 bit build issue.

v5:
1. Enable RTE_ARM_FEATURE_ATOMICS on octeontx2 in default. (Jerin Jocob)
2. Record the reason of introducing "rte_stack_lf_stubs.h" in git
commit.
(Jerin, Jocob)
3. Fixed a conditional MACRO error in rte_atomic128_cmp_exchange. (Jerin
Jocob)

v4:
1. Add RTE_ARM_FEATURE_ATOMICS flag to support LSE CASP instructions.
(Jerin Jocob)
2. Fix possible arm64 ABI break by making casp_op_name noinline. (Jerin
Jocob)
3. Add rte_stack_lf_stubs.h to reduce the ifdef clutter. (Gage
Eads/Jerin Jocob)

v3:
1. Avoid duplication code with macro. (Jerin Jocob)
2. Make invalid memory order to strongest barrier. (Jerin Jocob)
3. Update doc/guides/prog_guide/env_abstraction_layer.rst. (Gage Eads)
4. Fix 32-bit x86 builds issue. (Gage Eads)
5. Correct documentation issues in UT. (Gage Eads)

v2:
Initial version.

 config/arm/meson.build |   2 +
 config/common_base |   3 +
 config/defconfig_arm64-octeontx2-linuxapp-gcc  |   1 +
 config/defconfig_arm64-thunderx2-linuxapp-gcc  |   1 +
 .../common/include/arch/arm/rte_atomic_64.h| 163 +
 .../common/include/arch/x86/rte_atomic_64.h|  12 --
 lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
 7 files changed, 186 insertions(+), 13 deletions(-)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..9f28271 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -71,11 +71,13 @@ flags_thunderx2_extra = [
['RTE_CACHE_LINE_SIZE', 64],
['RTE_MAX_NUMA_NODES', 2],
['RTE_MAX_LCORE', 256],
+   ['RTE_ARM_FEATURE_ATOMICS', true],
['RTE_USE_C11_MEM_MODEL', true]]
 flags_octeontx2_extra = [
['RTE_MACHINE', '"octeontx2"'],
['RTE_MAX_NUMA_NODES', 1],
['RTE_MAX_LCORE', 24],
+   ['RTE_ARM_FEATURE_ATOMICS', true],
['RTE_EAL_IGB_UIO', false],
['RTE_USE_C11_MEM_MODEL', true]]
 
diff --git a/config/common_base b/config/common_base
index 8ef75c2..2054480 100644
--- a/config/common_base
+++ b/config/common_base
@@ -82,6 +82,9 @@ CONFIG_RTE_MAX_LCORE=128
 CONFIG_RTE_MAX_NUMA_NODES=8
 CONFIG_RTE_MAX_HEAPS=32
 CONFIG_RTE_MAX_MEMSEG_LISTS=64
+
+# Use ARM LSE ATOMIC instructions
+CONFIG_RTE_ARM_FEATURE_ATOMICS=n
 # each memseg list will be limited to either RTE_MAX_MEMSEG_PER_LIST pages
 # or RTE_MAX_MEM_MB_PER_LIST megabytes worth of memory, whichever is smaller
 CONFIG_RTE_MAX_MEMSEG_PER_LIST=8192
diff --git a/config/defconfig_arm64-octeontx2-linuxapp-gcc 
b/config/defconfig_arm64-octeontx2-linuxapp-gcc
index f20da24..7687dbe 100644
--- a/config/defconfig_arm64-octeontx2-linuxapp-gcc
+++ b/config/defconfig_arm64-octeontx2-linuxapp-gcc
@@ -9,6 +9,7 @@ CONFIG_RTE_MACHINE="octeontx2"
 CONFIG_RTE_CACHE_LINE_SIZE=128
 CONFIG_RTE_MAX_NUMA_NODES=1
 CONFIG_RTE_MAX_LCORE=24
+CONFIG_RTE_ARM_FEATURE_ATOMICS=y
 
 # Doesn't support NUMA
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
diff --git a/config/defconfig_arm64-thunderx2-linuxapp-gcc 
b/config/defconfig_arm64-thunderx2-linuxapp-gcc
index cc5c64b..af4a89c 100644
--- a/config/defconfig_arm64-thunderx2-linuxapp-gcc
+++ b/config/defconfig_arm64-thunderx2-linuxapp-gcc
@@ -9,3 +9,4 @@ CONFIG_RTE_MACHINE="thunderx2"
 CONFIG_RTE_CACHE_LINE_SIZE=64
 CONFIG_RTE_MAX_NUMA_NODES=2
 CONFIG_RTE_MAX_LCORE=256
+CONFIG_RTE_ARM_FEATURE_ATOMICS=y
diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..14d869b 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2015 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_ATOMIC_ARM64_H_
@@ -14,6 +15,9 @@ extern "C" {
 #endif
 
 #include "generic/rte_atomic.h"
+#include 
+#include 
+#include 
 
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
@@ -40,6 +44,165 @@ extern "C" {
 
 #define rte_cio_rmb() dmb(oshld)
 
+/* 128 bit atomic operations 
-*/
+
+#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) != __ATOMIC_RELEASE)
+#define __HAS_RLS(mo) ((mo) == __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || 
\
+ (mo) == __ATOMIC_SEQ_CST)
+
+#define __MO_LOAD(mo)  (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED