Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-26 Thread Mattias Rönnblom

On 2024-05-22 21:51, Stephen Hemminger wrote:

On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff  wrote:


On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:

From: Stephen Hemminger [mailto:step...@networkplumber.org]
Sent: Wednesday, 22 May 2024 17.38

On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup  wrote:
   

+/* On 32 bit platform, need to use atomic to avoid load/store

tearing */

+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;


As shown by Godbolt experiments discussed in a previous thread [2],

non-tearing 64 bit counters can be implemented without using atomic
instructions on all 32 bit architectures supported by DPDK. So we should
use the counter/offset design pattern for RTE_ARCH_32 too.


[2]:

https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
erver.smartshare.dk/


This code built with -O3 and -m32 on godbolt shows split problem.

#include 

typedef uint64_t rte_counter64_t;

void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
*counter += val;
}
…   *counter = val;
}

rte_counter64_add:
 pushebx
 mov eax, DWORD PTR [esp+8]
 xor ebx, ebx
 mov ecx, DWORD PTR [esp+12]
 add DWORD PTR [eax], ecx
 adc DWORD PTR [eax+4], ebx
 pop ebx
 ret

rte_counter64_read:
 mov eax, DWORD PTR [esp+4]
 mov edx, DWORD PTR [eax+4]
 mov eax, DWORD PTR [eax]
 ret
rte_counter64_set:
 movqxmm0, QWORD PTR [esp+8]
 mov eax, DWORD PTR [esp+4]
 movqQWORD PTR [eax], xmm0
 ret


Sure, atomic might be required on some 32 bit architectures and/or with some 
compilers.


in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..

what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?


If we use atomic with relaxed memory order, then compiler for x86 still 
generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.



It's because the code is overly restrictive (e.g., needlessly forcing 
the whole read-modify-read being atomic), in that case, and no fault of 
the compiler.


void add(uint64_t *addr, uint64_t operand)
{
uint64_t value = __atomic_load_n(addr, __ATOMIC_RELAXED);
value += operand;
__atomic_store_n(addr, value, __ATOMIC_RELAXED);
}

->

x86_64

add:
mov rax, QWORD PTR [rdi]
add rax, rsi
mov QWORD PTR [rdi], rax
ret


x86

add:
sub esp, 12
mov ecx, DWORD PTR [esp+16]
movqxmm0, QWORD PTR [ecx]
movqQWORD PTR [esp], xmm0
mov eax, DWORD PTR [esp]
mov edx, DWORD PTR [esp+4]
add eax, DWORD PTR [esp+20]
adc edx, DWORD PTR [esp+24]
mov DWORD PTR [esp], eax
mov DWORD PTR [esp+4], edx
movqxmm1, QWORD PTR [esp]
movqQWORD PTR [ecx], xmm1
add esp, 12
ret

No locked instructions.


PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that 
platform
anymore.




Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-26 Thread Mattias Rönnblom

On 2024-05-22 21:01, Tyler Retzlaff wrote:

On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:

From: Stephen Hemminger [mailto:step...@networkplumber.org]
Sent: Wednesday, 22 May 2024 17.38

On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup  wrote:


+/* On 32 bit platform, need to use atomic to avoid load/store

tearing */

+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;


As shown by Godbolt experiments discussed in a previous thread [2],

non-tearing 64 bit counters can be implemented without using atomic
instructions on all 32 bit architectures supported by DPDK. So we should
use the counter/offset design pattern for RTE_ARCH_32 too.


[2]:

https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
erver.smartshare.dk/


This code built with -O3 and -m32 on godbolt shows split problem.

#include 

typedef uint64_t rte_counter64_t;

void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
*counter += val;
}
…   *counter = val;
}

rte_counter64_add:
 pushebx
 mov eax, DWORD PTR [esp+8]
 xor ebx, ebx
 mov ecx, DWORD PTR [esp+12]
 add DWORD PTR [eax], ecx
 adc DWORD PTR [eax+4], ebx
 pop ebx
 ret

rte_counter64_read:
 mov eax, DWORD PTR [esp+4]
 mov edx, DWORD PTR [eax+4]
 mov eax, DWORD PTR [eax]
 ret
rte_counter64_set:
 movqxmm0, QWORD PTR [esp+8]
 mov eax, DWORD PTR [esp+4]
 movqQWORD PTR [eax], xmm0
 ret


Sure, atomic might be required on some 32 bit architectures and/or with some 
compilers.


in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..

what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?
  


Below is what I originally proposed in the "make stats reset reliable" 
thread.


struct counter
{
uint64_t count;
uint64_t offset;
};

/../
struct counter rx_pkts;
struct counter rx_bytes;
/../

static uint64_t
counter_value(const struct counter *counter)
{
uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);

return count - offset;
}

static void
counter_reset(struct counter *counter)
{
uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);

__atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
}

static void
counter_add(struct counter *counter, uint64_t operand)
{
__atomic_store_n(&counter->count, counter->count + operand, 
__ATOMIC_RELAXED);

}

I think this solution generally compiles to something that's equivalent 
to just using non-atomic loads/stores and hope for the best.


Using a non-atomic load in counter_add() will generate better code, but 
doesn't work if you using _Atomic (w/o casts).


Atomic load/stores seems to have volatile semantics, so multiple counter 
updates to the same counter cannot be merged. That is a drawback.



I envision a variety of 32 bit implementations, optimized for certain 
architectures/compilers.

Some of them can provide non-tearing 64 bit load/store, so we should also use 
the counter/offset design pattern for those.



RE: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Wednesday, 22 May 2024 21.54
> 
> On Wed, 22 May 2024 20:09:23 +0200
> Morten Brørup  wrote:
> 
> > > There are no size_bins in the current version of the patch.
> > > And the number of counters in ethdev part are small so it is less of
> a
> > > concern.
> > > The code is easier to maintain if the counter object is self
> contained.
> >
> > I agree that there are advantages to keeping the counter object self
> contained.
> >
> > However, these counters are generic, so we cannot assume that there
> are only very few, based on how the current software device drivers use
> them.
> >
> > Someone might want to add size_bins to the software device drivers.
> > And someone else might want to collect many counters in some
> application or library structure.
> 
> 
> No.
> The implementation should be as simple and as small as possible for the
> use case
> that is presented in the patch series.

I checked a random one of the use cases presented, rte_eth_af_packet:

The pkt_rx_queue struct grows to two cache lines, with the packets counter in 
the first cache line and the other counters in the second cache line.
By moving pkt_rx_queue's "sockfd" field below the pointers, the structure is 
better packed. If following my proposal, i.e. keeping the counters grouped 
together (and the offsets grouped together), all three counters stay within the 
first cache line (and the offsets go into the second).

The pkt_tx_queue struct also grows to two cache lines, with the first two 
counters in the first cache line and the third counter in the second cache 
line. With my proposal, the counters fit within the first cache line.

> Doing something more complex
> leads to the
> classic YAGNI situation, where when the new case really happens the
> implemenation
> just doesn't quite fit.

I disagree about YAGNI for this patch.
We *do* need the counter API to have the offset separate from the counter to 
avoid a performance degradation.
It is only slightly more complex, so I'm not convinced it's going to be a 
problem.
Passing a pointer to the offset as an additional parameter to fetch() and 
reset() is straightforward.
And using two instances of struct rte_eth_counters, one for counters and one 
for offsets, isn't complex either.

The rte_eth_af_packet use case shows that the risk of touching an increased 
number of cache lines (by moving the offsets into the hot part of the 
structures) is real.



Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Stephen Hemminger
On Wed, 22 May 2024 20:09:23 +0200
Morten Brørup  wrote:

> > There are no size_bins in the current version of the patch.
> > And the number of counters in ethdev part are small so it is less of a
> > concern.
> > The code is easier to maintain if the counter object is self contained.  
> 
> I agree that there are advantages to keeping the counter object self 
> contained.
> 
> However, these counters are generic, so we cannot assume that there are only 
> very few, based on how the current software device drivers use them.
> 
> Someone might want to add size_bins to the software device drivers.
> And someone else might want to collect many counters in some application or 
> library structure.


No.
The implementation should be as simple and as small as possible for the use case
that is presented in the patch series. Doing something more complex leads to the
classic YAGNI situation, where when the new case really happens the 
implemenation
just doesn't quite fit.


Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Stephen Hemminger
On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff  wrote:

> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
> > > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > > Sent: Wednesday, 22 May 2024 17.38
> > > 
> > > On Wed, 22 May 2024 10:31:39 +0200
> > > Morten Brørup  wrote:
> > >   
> > > > > +/* On 32 bit platform, need to use atomic to avoid load/store  
> > > tearing */  
> > > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> > > >
> > > > As shown by Godbolt experiments discussed in a previous thread [2],  
> > > non-tearing 64 bit counters can be implemented without using atomic
> > > instructions on all 32 bit architectures supported by DPDK. So we should
> > > use the counter/offset design pattern for RTE_ARCH_32 too.  
> > > >
> > > > [2]:  
> > > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> > > erver.smartshare.dk/
> > > 
> > > 
> > > This code built with -O3 and -m32 on godbolt shows split problem.
> > > 
> > > #include 
> > > 
> > > typedef uint64_t rte_counter64_t;
> > > 
> > > void
> > > rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> > > {
> > >   *counter += val;
> > > }
> > > … *counter = val;
> > > }
> > > 
> > > rte_counter64_add:
> > > pushebx
> > > mov eax, DWORD PTR [esp+8]
> > > xor ebx, ebx
> > > mov ecx, DWORD PTR [esp+12]
> > > add DWORD PTR [eax], ecx
> > > adc DWORD PTR [eax+4], ebx
> > > pop ebx
> > > ret
> > > 
> > > rte_counter64_read:
> > > mov eax, DWORD PTR [esp+4]
> > > mov edx, DWORD PTR [eax+4]
> > > mov eax, DWORD PTR [eax]
> > > ret
> > > rte_counter64_set:
> > > movqxmm0, QWORD PTR [esp+8]
> > > mov eax, DWORD PTR [esp+4]
> > > movqQWORD PTR [eax], xmm0
> > > ret  
> > 
> > Sure, atomic might be required on some 32 bit architectures and/or with 
> > some compilers.  
> 
> in theory i think you should be able to use generic atomics and
> depending on the target you get codegen that works. it might be
> something more expensive on 32-bit and nothing on 64-bit etc..
> 
> what's the damage if we just use atomic generic and relaxed ordering? is
> the codegen not optimal?

If we use atomic with relaxed memory order, then compiler for x86 still 
generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.

PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that 
platform
anymore.




Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Tyler Retzlaff
On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
> > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > Sent: Wednesday, 22 May 2024 17.38
> > 
> > On Wed, 22 May 2024 10:31:39 +0200
> > Morten Brørup  wrote:
> > 
> > > > +/* On 32 bit platform, need to use atomic to avoid load/store
> > tearing */
> > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> > >
> > > As shown by Godbolt experiments discussed in a previous thread [2],
> > non-tearing 64 bit counters can be implemented without using atomic
> > instructions on all 32 bit architectures supported by DPDK. So we should
> > use the counter/offset design pattern for RTE_ARCH_32 too.
> > >
> > > [2]:
> > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> > erver.smartshare.dk/
> > 
> > 
> > This code built with -O3 and -m32 on godbolt shows split problem.
> > 
> > #include 
> > 
> > typedef uint64_t rte_counter64_t;
> > 
> > void
> > rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> > {
> > *counter += val;
> > }
> > …   *counter = val;
> > }
> > 
> > rte_counter64_add:
> > pushebx
> > mov eax, DWORD PTR [esp+8]
> > xor ebx, ebx
> > mov ecx, DWORD PTR [esp+12]
> > add DWORD PTR [eax], ecx
> > adc DWORD PTR [eax+4], ebx
> > pop ebx
> > ret
> > 
> > rte_counter64_read:
> > mov eax, DWORD PTR [esp+4]
> > mov edx, DWORD PTR [eax+4]
> > mov eax, DWORD PTR [eax]
> > ret
> > rte_counter64_set:
> > movqxmm0, QWORD PTR [esp+8]
> > mov eax, DWORD PTR [esp+4]
> > movqQWORD PTR [eax], xmm0
> > ret
> 
> Sure, atomic might be required on some 32 bit architectures and/or with some 
> compilers.

in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..

what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?
 
> I envision a variety of 32 bit implementations, optimized for certain 
> architectures/compilers.
> 
> Some of them can provide non-tearing 64 bit load/store, so we should also use 
> the counter/offset design pattern for those.
> 


RE: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Wednesday, 22 May 2024 17.33
> 
> On Wed, 22 May 2024 10:31:39 +0200
> Morten Brørup  wrote:
> 
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice.
> > > + *
> > > + * On native 64 bit platform, counter is implemented as basic
> > > + * 64 bit unsigned integer that only increases.
> > > + */
> > > +typedef struct {
> > > + uint64_t current;
> > > + uint64_t offset;
> > > +} rte_counter64_t;
> >
> > As discussed in the other thread [1], I strongly prefer having
> "current" and "offset" separate, for performance reasons.
> > Keeping each offset close together with its counter will require more
> cache lines than necessary, because the offsets take up space in the hot
> part of a fast path data structure. E.g. the size_bins[] counters could
> fit into one cache line instead of two.
> >
> > [1]:
> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F422@smarts
> erver.smartshare.dk/
> 
> There are no size_bins in the current version of the patch.
> And the number of counters in ethdev part are small so it is less of a
> concern.
> The code is easier to maintain if the counter object is self contained.

I agree that there are advantages to keeping the counter object self contained.

However, these counters are generic, so we cannot assume that there are only 
very few, based on how the current software device drivers use them.

Someone might want to add size_bins to the software device drivers.
And someone else might want to collect many counters in some application or 
library structure.

> 
> > > +
> > > +/* On 32 bit platform, need to use atomic to avoid load/store
> tearing */
> > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> >
> > As shown by Godbolt experiments discussed in a previous thread [2],
> non-tearing 64 bit counters can be implemented without using atomic
> instructions on all 32 bit architectures supported by DPDK. So we should
> use the counter/offset design pattern for RTE_ARCH_32 too.
> >
> > [2]:
> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> erver.smartshare.dk/
> 
> Bruce found some 32 bit versions of x86 have the problem.
> Godbolt doesn't seem to list 32 bit x86 compiler for Gcc.
> If you try MSVC in 32 bit mode it will split the loads.
> I see no problem on ARM which is the only other 32 bit we care about.
> 

Yeah, there seems to be a lot of work testing compiler behavior here. Let's 
start with a generic 32 bit implementation based on atomics, which should work 
correctly on all architectures/compilers.
Optimized architecture- and compiler-optimized variants can be added by 
interested CPU vendors later.



RE: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Wednesday, 22 May 2024 17.38
> 
> On Wed, 22 May 2024 10:31:39 +0200
> Morten Brørup  wrote:
> 
> > > +/* On 32 bit platform, need to use atomic to avoid load/store
> tearing */
> > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> >
> > As shown by Godbolt experiments discussed in a previous thread [2],
> non-tearing 64 bit counters can be implemented without using atomic
> instructions on all 32 bit architectures supported by DPDK. So we should
> use the counter/offset design pattern for RTE_ARCH_32 too.
> >
> > [2]:
> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> erver.smartshare.dk/
> 
> 
> This code built with -O3 and -m32 on godbolt shows split problem.
> 
> #include 
> 
> typedef uint64_t rte_counter64_t;
> 
> void
> rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> {
>   *counter += val;
> }
> … *counter = val;
> }
> 
> rte_counter64_add:
> pushebx
> mov eax, DWORD PTR [esp+8]
> xor ebx, ebx
> mov ecx, DWORD PTR [esp+12]
> add DWORD PTR [eax], ecx
> adc DWORD PTR [eax+4], ebx
> pop ebx
> ret
> 
> rte_counter64_read:
> mov eax, DWORD PTR [esp+4]
> mov edx, DWORD PTR [eax+4]
> mov eax, DWORD PTR [eax]
> ret
> rte_counter64_set:
> movqxmm0, QWORD PTR [esp+8]
> mov eax, DWORD PTR [esp+4]
> movqQWORD PTR [eax], xmm0
> ret

Sure, atomic might be required on some 32 bit architectures and/or with some 
compilers.

I envision a variety of 32 bit implementations, optimized for certain 
architectures/compilers.

Some of them can provide non-tearing 64 bit load/store, so we should also use 
the counter/offset design pattern for those.



Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Stephen Hemminger
On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup  wrote:

> > +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> 
> As shown by Godbolt experiments discussed in a previous thread [2], 
> non-tearing 64 bit counters can be implemented without using atomic 
> instructions on all 32 bit architectures supported by DPDK. So we should use 
> the counter/offset design pattern for RTE_ARCH_32 too.
> 
> [2]: 
> https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/


This code built with -O3 and -m32 on godbolt shows split problem.

#include 

typedef uint64_t rte_counter64_t;

void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
*counter += val;
}
…   *counter = val;
}

rte_counter64_add:
pushebx
mov eax, DWORD PTR [esp+8]
xor ebx, ebx
mov ecx, DWORD PTR [esp+12]
add DWORD PTR [eax], ecx
adc DWORD PTR [eax+4], ebx
pop ebx
ret

rte_counter64_read:
mov eax, DWORD PTR [esp+4]
mov edx, DWORD PTR [eax+4]
mov eax, DWORD PTR [eax]
ret
rte_counter64_set:
movqxmm0, QWORD PTR [esp+8]
mov eax, DWORD PTR [esp+4]
movqQWORD PTR [eax], xmm0
ret


Re: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Stephen Hemminger
On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup  wrote:

> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * On native 64 bit platform, counter is implemented as basic
> > + * 64 bit unsigned integer that only increases.
> > + */
> > +typedef struct {
> > +   uint64_t current;
> > +   uint64_t offset;
> > +} rte_counter64_t;  
> 
> As discussed in the other thread [1], I strongly prefer having "current" and 
> "offset" separate, for performance reasons.
> Keeping each offset close together with its counter will require more cache 
> lines than necessary, because the offsets take up space in the hot part of a 
> fast path data structure. E.g. the size_bins[] counters could fit into one 
> cache line instead of two.
> 
> [1]: 
> https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

There are no size_bins in the current version of the patch.
And the number of counters in ethdev part are small so it is less of a concern.
The code is easier to maintain if the counter object is self contained.

> > + * @internal
> > + * Macro to implement write once (compiler barrier) using stdatomic.
> > + * This is compiler barrier only.
> > + */
> > +#define __rte_write_once(var, val) \
> > +   rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
> > +   rte_memory_order_release)  
> 
> These macros certainly make the code using them shorter.
> But IMHO, they don't improve the readability of the code using them; quite 
> the opposite. Reviewing code using atomics is hard, so I prefer having the 
> memory order directly shown in the code, not hidden behind a macro.

Agree, was going to drop them in next version.

> > +__rte_experimental
> > +static inline uint64_t
> > +rte_counter64_read(const rte_counter64_t *counter)
> > +{
> > +   return __rte_read_once(counter->current) - __rte_read_once(counter-  
> > >offset);  
> 
> I'm not sure that "current" needs to be read using rte_memory_order_consume 
> here; I think rte_memory_order_consume for "offset" and 
> rte_memory_order_relaxed (or perhaps just volatile) for "counter" suffices. 
> But let's settle on the high level design before we start micro-optimizing. 
> :-)

memory order consume is compiler barrier only. Was trying to choose what was 
best here.

> > +
> > +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> 
> As shown by Godbolt experiments discussed in a previous thread [2], 
> non-tearing 64 bit counters can be implemented without using atomic 
> instructions on all 32 bit architectures supported by DPDK. So we should use 
> the counter/offset design pattern for RTE_ARCH_32 too.
> 
> [2]: 
> https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

Bruce found some 32 bit versions of x86 have the problem.
Godbolt doesn't seem to list 32 bit x86 compiler for Gcc.
If you try MSVC in 32 bit mode it will split the loads.
I see no problem on ARM which is the only other 32 bit we care about.




RE: [PATCH v9 1/8] eal: generic 64 bit counter

2024-05-22 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Tuesday, 21 May 2024 22.17
> 
> This header implements 64 bit counters using atomic
> operations but with a weak memory ordering so that
> they are safe against load/store splits on 32 bit platforms.
> 
> Signed-off-by: Stephen Hemminger 
> Acked-by: Morten Brørup 
> ---
>  lib/eal/include/meson.build   |   1 +
>  lib/eal/include/rte_counter.h | 150 ++
>  2 files changed, 151 insertions(+)
>  create mode 100644 lib/eal/include/rte_counter.h
> 
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..c070dd0079 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -12,6 +12,7 @@ headers += files(
>  'rte_class.h',
>  'rte_common.h',
>  'rte_compat.h',
> +'rte_counter.h',
>  'rte_debug.h',
>  'rte_dev.h',
>  'rte_devargs.h',
> diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
> new file mode 100644
> index 00..f0c2b71a6c
> --- /dev/null
> +++ b/lib/eal/include/rte_counter.h
> @@ -0,0 +1,150 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) Stephen Hemminger 
> + */
> +
> +#ifndef _RTE_COUNTER_H_
> +#define _RTE_COUNTER_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * @file
> + * RTE Counter
> + *
> + * A counter is 64 bit value that is safe from split read/write.
> + * It assumes that only one cpu at a time  will update the counter,
> + * and another CPU may want to read it.
> + *
> + * This is a weaker subset of full atomic variables.
> + *
> + * The counters are subject to the restrictions of atomic variables
> + * in packed structures or unaligned.
> + */
> +
> +#ifdef RTE_ARCH_64
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * On native 64 bit platform, counter is implemented as basic
> + * 64 bit unsigned integer that only increases.
> + */
> +typedef struct {
> + uint64_t current;
> + uint64_t offset;
> +} rte_counter64_t;

As discussed in the other thread [1], I strongly prefer having "current" and 
"offset" separate, for performance reasons.
Keeping each offset close together with its counter will require more cache 
lines than necessary, because the offsets take up space in the hot part of a 
fast path data structure. E.g. the size_bins[] counters could fit into one 
cache line instead of two.

[1]: 
https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

The disadvantages of a slightly larger API is insignificant, compared to the 
performance cost of not separating them.

> +
> +/**
> + * @internal
> + * Macro to implement read once (compiler barrier) using stdatomic.
> + * This is compiler barrier only.
> + */
> +#define __rte_read_once(var) \
> + rte_atomic_load_explicit((__rte_atomic typeof(&(var)))&(var),   \
> + rte_memory_order_consume)
> +
> +/**
> + * @internal
> + * Macro to implement write once (compiler barrier) using stdatomic.
> + * This is compiler barrier only.
> + */
> +#define __rte_write_once(var, val)   \
> + rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
> + rte_memory_order_release)

These macros certainly make the code using them shorter.
But IMHO, they don't improve the readability of the code using them; quite the 
opposite. Reviewing code using atomics is hard, so I prefer having the memory 
order directly shown in the code, not hidden behind a macro.

> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Add value to counter.
> + * Assumes this operation is only done by one thread on the object.
> + *
> + * @param counter
> + *A pointer to the counter.
> + * @param val
> + *The value to add to the counter.
> + */
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> + counter->current += val;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read a counter.
> + * This operation can be done by any thread.
> + *
> + * @param counter
> + *A pointer to the counter.
> + * @return
> + *  The current value of the counter.
> + */
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_read(const rte_counter64_t *counter)
> +{
> + return __rte_read_once(counter->current) - __rte_read_once(counter-
> >offset);

I'm not sure that "current" needs to be read using rte_memory_order_consume 
here; I think rte_memory_order_consume for "offset" and 
rte_memory_order_relaxed (or perhaps just volatile) for "counter" suffices. But 
let's settle on the high level design before we start micro-optimizing. :-)

> +}
> +
> +/**
> + * @warning
> + * @b EXP