RE: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-20 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Monday, 20 May 2024 00.49
> 
> On Sun, 19 May 2024 19:10:30 +0200
> Morten Brørup  wrote:
> 
> > Absolutely; whenever possible, local counters should be maintained
> inside the loop, and added to the public counters at the end of a loop.
> >
> > Please note that application counters might be spread all over memory.
> > E.g. per-flow in a flow structure, per QoS class in a QoS class
> structure, per subscriber in a subscriber structure, etc. And a burst of
> packets might touch multiple of these. My point is: Atomic read-modify-
> write of counters will cause serious stalling, waiting for memory access
> 
> If an application needs to keep up at DPDK possible speeds, then it
> needs to worry about its
> cache access patterns. Last time I checked handling 10G 64 byte packets
> at line rate without loss
> means a maximum of 2 cache misses. Very hard to do with any non trivial
> application.

Yes, very hard.
Which is why I insist that counters must have the absolutely highest possible 
performance in the fast path.
Non-trivial applications are likely to maintain many instances of application 
specific counters.

I consider this patch in the series as the EAL library generic 64 bit counters, 
and for application use too. So I am reviewing with a much broader perspective 
than just SW drivers.
The SW drivers' use of these counters is not only an improvement of those 
drivers, it's also an excellent reference use case showing how to use this new 
EAL 64 bit counters library.

> 
> Also, SW QoS above 1G is very hard to do with modern CPU's. Especially
> with multiple flows.
> It maybe possible with something like FQ Codel which only keeps small
> amount of state.

Yep. Designing software for high performance is all about using optimal 
algorithms! :-)



Re: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-19 Thread Stephen Hemminger
On Sun, 19 May 2024 19:10:30 +0200
Morten Brørup  wrote:

> Absolutely; whenever possible, local counters should be maintained inside the 
> loop, and added to the public counters at the end of a loop.
> 
> Please note that application counters might be spread all over memory.
> E.g. per-flow in a flow structure, per QoS class in a QoS class structure, 
> per subscriber in a subscriber structure, etc. And a burst of packets might 
> touch multiple of these. My point is: Atomic read-modify-write of counters 
> will cause serious stalling, waiting for memory access

If an application needs to keep up at DPDK possible speeds, then it needs to 
worry about its
cache access patterns. Last time I checked handling 10G 64 byte packets at line 
rate without loss
means a maximum of 2 cache misses. Very hard to do with any non trivial 
application.

Also, SW QoS above 1G is very hard to do with modern CPU's. Especially with 
multiple flows.
It maybe possible with something like FQ Codel which only keeps small amount of 
state.


RE: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-19 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Sunday, 19 May 2024 17.14
> 
> On Sat, 18 May 2024 16:00:55 +0200
> Morten Brørup  wrote:
> 
> > > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > > Sent: Friday, 17 May 2024 18.18
> > >
> > > On Fri, 17 May 2024 08:44:42 +0200
> > > Morten Brørup  wrote:
> > >
> > > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> > > > > Sent: Friday, 17 May 2024 06.27
> > > > >
> > > > > + Bruce for feedback on x86 architecture
> > > > >
> > > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger
> > > 
> > > > > wrote:
> > > > > >
> > > > > > On Fri, 17 May 2024 02:45:12 +
> > > > > > Honnappa Nagarahalli  wrote:
> > > > > >
> > > > > >>> + * A counter is 64 bit value that is safe from split
> read/write
> > > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a
> time
> > > > > >> If we are defining the counter in this manner, then
> > > implementation cannot
> > > > > be generic. I think architectures will have constraints if they
> have
> > > to ensure
> > > > > the 64b variables are not split.
> > > > > >>
> > > > > >> I think we at least need the counter to be aligned on 8B
> boundary
> > > to have
> > > > > generic code.
> > > > > >
> > > > > > The C standard has always guaranteed that read and write to
> > > unsigned log
> > > > > will not be split.
> > > > > As I understand, this is true only if the variable is an atomic
> > > variable. If
> > > > > not, there is no difference between atomic variables and non-
> atomic
> > > variables.
> > > > >
> > > > > > Therefore if arch is 64 bit native there is no need for
> atomics
> > > > > At least on ARM architecture, if the variable is not aligned on
> 8B
> > > boundary,
> > > > > the load or store are not atomic. I am sure it is the same on
> other
> > > > > architectures.
> > >
> > > After reading this: Who's afraid of a big bad optimizing compiler?
> > >  https://lwn.net/Articles/793253/
> >
> > Very interesting article!
> >
> > >
> > > Looks like you are right, and atomic or read/write once is required.
> >
> > I don't like the performance tradeoff (for 64 bit architectures) in
> the v7 patch.
> > For single-tread updated counters, we MUST have a high performance
> counter_add(), not using atomic read-modify-write.
> 
> The fundamental issue is that for SW drivers, having a totally safe
> reset function requires
> an atomic operation in the fast path for increment. Otherwise the
> following (highly unlikely) race
> is possible:
> 
>   CPU A   CPU B
>   load counter (value = X)
>   store counter = 0
>   store counter (value = X + 1)
> 

Yes, this is why I suggest the "offset" method, option 3.
Depending on which CPU wins the race, reset() will set "offset" to either X or 
X + 1 using your example here.
"offset" is atomic, so reading it cannot race writing it. And thus:
if counter = X was visible at reset(), fetch() will return counter - offset = X 
- X = 0, and
if counter = X + 1 was visible at reset(), fetch() will return counter - offset 
= (X + 1) - (X + 1) = 0.

> 
> >
> > IMO calling counter_fetch() MUST be possible from another thread.
> > This requires that the fast path thread stores the counter atomically
> (or using volatile), to ensure that the counter is not only kept in a
> CPU register, but stored in memory visible by other threads.
> >
> > For counter_reset(), we have multiple options:
> > 0. Don't support counter resetting. Not really on option?
> 
> We could reject it in the SW drivers. But better not to.

Agree.

> 
> > 1. Only allow calling counter_reset() in the fast path thread updating
> the counter. This introduces no further requirements.
> 
> Not a good restriction

Not good, but the alternatives seem to be worse.

> 
> > 2. Allow calling counter_reset() from another thread, thereby zeroing
> the "counter" variable. This introduces a requirement for the "counter"
> to be thread-safe, so counter_add() must atomically read-modify-write
> the counter, which has a performance cost.

Bad for performance.

> 
> > 3. Allow calling counter_reset() from another thread, and introduce an
> "offset" variable for counter_fetch() and counter_reset() to provide
> thread-safe fetch/reset from other threads, using the consume-release
> pattern.
> 
> Too confusing

Using offsets for pseudo-reset is a common design pattern.
And the performance is excellent.

Perhaps the Linux kernel doesn't use this design pattern, but that is not a 
valid reason to disqualify it.

> 
> >
> > I don't like option 2.
> > I consider counters too important and frequently used in the fast
> path, to compromise on performance for counters.
> 
> Agree.
> 
> >
> > For counters updated by multiple fast path threads,
> atomic_fetch_add_explicit() of the "counter" variable seems unavoidable.
> 
> Not at all worried about overhead in slow (fetch) path.

Agree. 

Re: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-19 Thread Stephen Hemminger
On Sat, 18 May 2024 16:00:55 +0200
Morten Brørup  wrote:

> > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > Sent: Friday, 17 May 2024 18.18
> > 
> > On Fri, 17 May 2024 08:44:42 +0200
> > Morten Brørup  wrote:
> >   
> > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> > > > Sent: Friday, 17 May 2024 06.27
> > > >
> > > > + Bruce for feedback on x86 architecture
> > > >  
> > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger  
> >   
> > > > wrote:  
> > > > >
> > > > > On Fri, 17 May 2024 02:45:12 +
> > > > > Honnappa Nagarahalli  wrote:
> > > > >  
> > > > >>> + * A counter is 64 bit value that is safe from split read/write
> > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time  
> > > > >> If we are defining the counter in this manner, then  
> > implementation cannot  
> > > > be generic. I think architectures will have constraints if they have  
> > to ensure  
> > > > the 64b variables are not split.  
> > > > >>
> > > > >> I think we at least need the counter to be aligned on 8B boundary  
> > to have  
> > > > generic code.  
> > > > >
> > > > > The C standard has always guaranteed that read and write to  
> > unsigned log  
> > > > will not be split.
> > > > As I understand, this is true only if the variable is an atomic  
> > variable. If  
> > > > not, there is no difference between atomic variables and non-atomic  
> > variables.  
> > > >  
> > > > > Therefore if arch is 64 bit native there is no need for atomics  
> > > > At least on ARM architecture, if the variable is not aligned on 8B  
> > boundary,  
> > > > the load or store are not atomic. I am sure it is the same on other
> > > > architectures.  
> > 
> > After reading this: Who's afraid of a big bad optimizing compiler?
> >  https://lwn.net/Articles/793253/  
> 
> Very interesting article!
> 
> > 
> > Looks like you are right, and atomic or read/write once is required.  
> 
> I don't like the performance tradeoff (for 64 bit architectures) in the v7 
> patch.
> For single-tread updated counters, we MUST have a high performance 
> counter_add(), not using atomic read-modify-write.

The fundamental issue is that for SW drivers, having a totally safe reset 
function requires
an atomic operation in the fast path for increment. Otherwise the following 
(highly unlikely) race
is possible:

CPU A   CPU B
load counter (value = X)
store counter = 0
store counter (value = X + 1)


> 
> IMO calling counter_fetch() MUST be possible from another thread.
> This requires that the fast path thread stores the counter atomically (or 
> using volatile), to ensure that the counter is not only kept in a CPU 
> register, but stored in memory visible by other threads.
> 
> For counter_reset(), we have multiple options:
> 0. Don't support counter resetting. Not really on option?

We could reject it in the SW drivers. But better not to.

> 1. Only allow calling counter_reset() in the fast path thread updating the 
> counter. This introduces no further requirements.

Not a good restriction

> 2. Allow calling counter_reset() from another thread, thereby zeroing the 
> "counter" variable. This introduces a requirement for the "counter" to be 
> thread-safe, so counter_add() must atomically read-modify-write the counter, 
> which has a performance cost.

> 3. Allow calling counter_reset() from another thread, and introduce an 
> "offset" variable for counter_fetch() and counter_reset() to provide 
> thread-safe fetch/reset from other threads, using the consume-release pattern.

Too confusing

> 
> I don't like option 2.
> I consider counters too important and frequently used in the fast path, to 
> compromise on performance for counters.

Agree.

> 
> For counters updated by multiple fast path threads, 
> atomic_fetch_add_explicit() of the "counter" variable seems unavoidable.

Not at all worried about overhead in slow (fetch) path.

> 
> > Perhaps introducing rte_read_once and rte_write_once is good idea?
> > Several drivers are implementing it already.  
> 
> The read_once/write_once are easier to use, but they lack the flexibility 
> (regarding barriers and locking) provided by their atomic_explicit 
> alternatives, which will impact performance in some use cases.

They solve the compiler reordering problem but do nothing about cpu ordering.
Also, there is no such increment.

> 
> We should strive for the highest possible performance, which means that we 
> shouldn't introduce functions or design patterns preventing this.
> Please note: Security vs. performance is another matter - we certainly don't 
> want to promote insecure code for the benefit of performance. But for "ease 
> of coding" vs. performance, I prefer performance.
> 
> That said, I agree that introducing rte_read/write_once functions for use by 
> drivers to access hardware makes sense, to eliminate 

RE: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-18 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Friday, 17 May 2024 18.18
> 
> On Fri, 17 May 2024 08:44:42 +0200
> Morten Brørup  wrote:
> 
> > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> > > Sent: Friday, 17 May 2024 06.27
> > >
> > > + Bruce for feedback on x86 architecture
> > >
> > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger
> 
> > > wrote:
> > > >
> > > > On Fri, 17 May 2024 02:45:12 +
> > > > Honnappa Nagarahalli  wrote:
> > > >
> > > >>> + * A counter is 64 bit value that is safe from split read/write
> > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time
> > > >> If we are defining the counter in this manner, then
> implementation cannot
> > > be generic. I think architectures will have constraints if they have
> to ensure
> > > the 64b variables are not split.
> > > >>
> > > >> I think we at least need the counter to be aligned on 8B boundary
> to have
> > > generic code.
> > > >
> > > > The C standard has always guaranteed that read and write to
> unsigned log
> > > will not be split.
> > > As I understand, this is true only if the variable is an atomic
> variable. If
> > > not, there is no difference between atomic variables and non-atomic
> variables.
> > >
> > > > Therefore if arch is 64 bit native there is no need for atomics
> > > At least on ARM architecture, if the variable is not aligned on 8B
> boundary,
> > > the load or store are not atomic. I am sure it is the same on other
> > > architectures.
> 
> After reading this: Who's afraid of a big bad optimizing compiler?
>  https://lwn.net/Articles/793253/

Very interesting article!

> 
> Looks like you are right, and atomic or read/write once is required.

I don't like the performance tradeoff (for 64 bit architectures) in the v7 
patch.
For single-tread updated counters, we MUST have a high performance 
counter_add(), not using atomic read-modify-write.

IMO calling counter_fetch() MUST be possible from another thread.
This requires that the fast path thread stores the counter atomically (or using 
volatile), to ensure that the counter is not only kept in a CPU register, but 
stored in memory visible by other threads.

For counter_reset(), we have multiple options:
0. Don't support counter resetting. Not really on option?
1. Only allow calling counter_reset() in the fast path thread updating the 
counter. This introduces no further requirements.
2. Allow calling counter_reset() from another thread, thereby zeroing the 
"counter" variable. This introduces a requirement for the "counter" to be 
thread-safe, so counter_add() must atomically read-modify-write the counter, 
which has a performance cost.
3. Allow calling counter_reset() from another thread, and introduce an "offset" 
variable for counter_fetch() and counter_reset() to provide thread-safe 
fetch/reset from other threads, using the consume-release pattern.

I don't like option 2.
I consider counters too important and frequently used in the fast path, to 
compromise on performance for counters.

For counters updated by multiple fast path threads, atomic_fetch_add_explicit() 
of the "counter" variable seems unavoidable.

> Perhaps introducing rte_read_once and rte_write_once is good idea?
> Several drivers are implementing it already.

The read_once/write_once are easier to use, but they lack the flexibility 
(regarding barriers and locking) provided by their atomic_explicit 
alternatives, which will impact performance in some use cases.

We should strive for the highest possible performance, which means that we 
shouldn't introduce functions or design patterns preventing this.
Please note: Security vs. performance is another matter - we certainly don't 
want to promote insecure code for the benefit of performance. But for "ease of 
coding" vs. performance, I prefer performance.

That said, I agree that introducing rte_read/write_once functions for use by 
drivers to access hardware makes sense, to eliminate copy-paste variants in 
drivers.
But how can we prevent such functions from being used for other purposes, where 
atomic_explicit should be used?



RE: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-16 Thread Morten Brørup
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> Sent: Friday, 17 May 2024 06.27
> 
> + Bruce for feedback on x86 architecture
> 
> > On May 16, 2024, at 10:30 PM, Stephen Hemminger 
> wrote:
> >
> > On Fri, 17 May 2024 02:45:12 +
> > Honnappa Nagarahalli  wrote:
> >
> >>> + * A counter is 64 bit value that is safe from split read/write
> >>> + * on 32 bit platforms. It assumes that only one cpu at a time
> >> If we are defining the counter in this manner, then implementation cannot
> be generic. I think architectures will have constraints if they have to ensure
> the 64b variables are not split.
> >>
> >> I think we at least need the counter to be aligned on 8B boundary to have
> generic code.
> >
> > The C standard has always guaranteed that read and write to unsigned log
> will not be split.
> As I understand, this is true only if the variable is an atomic variable. If
> not, there is no difference between atomic variables and non-atomic variables.
> 
> > Therefore if arch is 64 bit native there is no need for atomics
> At least on ARM architecture, if the variable is not aligned on 8B boundary,
> the load or store are not atomic. I am sure it is the same on other
> architectures.

I guess it depends on the architecture's natural alignment size and the 
compiler - especially on 32 bit architectures, where the natural alignment size 
is 4 bytes.

We could play it safe and add alignment to the counter type:

#include 
#ifdef RTE_ARCH_64
#if alignof(uint64_t) < sizeof(uint64_t)
typedef alignas(8) uint64_t rte_counter64_t;
#else
typedef uint64_t rte_counter64_t;
#endif
#else
#if alignof(RTE_ATOMIC(uint64_t)) < sizeof(uint64_t)
typedef alignas(8) RTE_ATOMIC(uint64_t) rte_counter64_t;
#else
typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
#endif
#endif

This above is intentionally verbose, because some 32 bit architectures can 
implement 64 bit non-tearing counters without atomics [1], and the new 64 bit 
counter library should be prepared for accepting such optimizations.
[1]: 
https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/


Picking up on another branch of this discussion:
This 64 bit counter library is *not* an internal API. Applications should use 
it for their counters too.

> Bruce, any comments for x86?



Re: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-16 Thread Honnappa Nagarahalli
+ Bruce for feedback on x86 architecture 

> On May 16, 2024, at 10:30 PM, Stephen Hemminger  
> wrote:
> 
> On Fri, 17 May 2024 02:45:12 +
> Honnappa Nagarahalli  wrote:
> 
>>> + * A counter is 64 bit value that is safe from split read/write
>>> + * on 32 bit platforms. It assumes that only one cpu at a time  
>> If we are defining the counter in this manner, then implementation cannot be 
>> generic. I think architectures will have constraints if they have to ensure 
>> the 64b variables are not split.
>> 
>> I think we at least need the counter to be aligned on 8B boundary to have 
>> generic code.
> 
> The C standard has always guaranteed that read and write to unsigned log will 
> not be split.
As I understand, this is true only if the variable is an atomic variable. If 
not, there is no difference between atomic variables and non-atomic variables.

> Therefore if arch is 64 bit native there is no need for atomics
At least on ARM architecture, if the variable is not aligned on 8B boundary, 
the load or store are not atomic. I am sure it is the same on other 
architectures.
Bruce, any comments for x86?



Re: [PATCH v6 1/9] eal: generic 64 bit counter

2024-05-16 Thread Honnappa Nagarahalli


> On May 16, 2024, at 7:12 PM, Stephen Hemminger  
> wrote:
> 
> This header implements 64 bit counters that are NOT atomic
> but are safe against load/store splits on 32 bit platforms.
> 
> Signed-off-by: Stephen Hemminger 
> Acked-by: Morten Brørup 
> ---
> lib/eal/include/meson.build   |  1 +
> lib/eal/include/rte_counter.h | 98 +++
> 2 files changed, 99 insertions(+)
> create mode 100644 lib/eal/include/rte_counter.h
> 
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..c070dd0079 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -12,6 +12,7 @@ headers += files(
> 'rte_class.h',
> 'rte_common.h',
> 'rte_compat.h',
> +'rte_counter.h',
> 'rte_debug.h',
> 'rte_dev.h',
> 'rte_devargs.h',
> diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
> new file mode 100644
> index 00..d623195d63
> --- /dev/null
> +++ b/lib/eal/include/rte_counter.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) Stephen Hemminger 
> + */
> +
> +#ifndef _RTE_COUNTER_H_
> +#define _RTE_COUNTER_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file
> + * RTE Counter
> + *
> + * A counter is 64 bit value that is safe from split read/write
> + * on 32 bit platforms. It assumes that only one cpu at a time
If we are defining the counter in this manner, then implementation cannot be 
generic. I think architectures will have constraints if they have to ensure the 
64b variables are not split.

I think we at least need the counter to be aligned on 8B boundary to have 
generic code.

> + * will update the counter, and another CPU may want to read it.
> + *
> + * This is a much weaker guarantee than full atomic variables
> + * but is faster since no locked operations are required for update.
> + */
> +
> +#ifdef RTE_ARCH_64
> +/*
> + * On a platform that can support native 64 bit type, no special handling.
> + * These are just wrapper around 64 bit value.
> + */
> +typedef uint64_t rte_counter64_t;
> +
> +/**
> + * Add value to counter.
> + */
> +__rte_experimental
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> + *counter += val;
> +}
> +
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_fetch(const rte_counter64_t *counter)
> +{
> + return *counter;
> +}
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_set(rte_counter64_t *counter, uint64_t val)
> +{
> + *counter = val;
> +}
> +
> +#else
> +
> +#include 
> +
> +/*
> + * On a 32 bit platform need to use atomic to force the compler to not
> + * split 64 bit read/write.
> + */
> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> + rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed);
> +}
> +
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_fetch(const rte_counter64_t *counter)
> +{
> + return rte_atomic_load_explicit(counter, rte_memory_order_consume);
> +}
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_set(rte_counter64_t *counter, uint64_t val)
> +{
> + rte_atomic_store_explicit(counter, val, rte_memory_order_release);
> +}
> +#endif
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_reset(rte_counter64_t *counter)
> +{
> + rte_counter64_set(counter, 0);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_COUNTER_H_ */
> -- 
> 2.43.0
>