RE: [PATCH v6 1/9] eal: generic 64 bit counter
> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Monday, 20 May 2024 00.49 > > On Sun, 19 May 2024 19:10:30 +0200 > Morten Brørup wrote: > > > Absolutely; whenever possible, local counters should be maintained > inside the loop, and added to the public counters at the end of a loop. > > > > Please note that application counters might be spread all over memory. > > E.g. per-flow in a flow structure, per QoS class in a QoS class > structure, per subscriber in a subscriber structure, etc. And a burst of > packets might touch multiple of these. My point is: Atomic read-modify- > write of counters will cause serious stalling, waiting for memory access > > If an application needs to keep up at DPDK possible speeds, then it > needs to worry about its > cache access patterns. Last time I checked handling 10G 64 byte packets > at line rate without loss > means a maximum of 2 cache misses. Very hard to do with any non trivial > application. Yes, very hard. Which is why I insist that counters must have the absolutely highest possible performance in the fast path. Non-trivial applications are likely to maintain many instances of application specific counters. I consider this patch in the series as the EAL library generic 64 bit counters, and for application use too. So I am reviewing with a much broader perspective than just SW drivers. The SW drivers' use of these counters is not only an improvement of those drivers, it's also an excellent reference use case showing how to use this new EAL 64 bit counters library. > > Also, SW QoS above 1G is very hard to do with modern CPU's. Especially > with multiple flows. > It maybe possible with something like FQ Codel which only keeps small > amount of state. Yep. Designing software for high performance is all about using optimal algorithms! :-)
Re: [PATCH v6 1/9] eal: generic 64 bit counter
On Sun, 19 May 2024 19:10:30 +0200 Morten Brørup wrote: > Absolutely; whenever possible, local counters should be maintained inside the > loop, and added to the public counters at the end of a loop. > > Please note that application counters might be spread all over memory. > E.g. per-flow in a flow structure, per QoS class in a QoS class structure, > per subscriber in a subscriber structure, etc. And a burst of packets might > touch multiple of these. My point is: Atomic read-modify-write of counters > will cause serious stalling, waiting for memory access If an application needs to keep up at DPDK possible speeds, then it needs to worry about its cache access patterns. Last time I checked handling 10G 64 byte packets at line rate without loss means a maximum of 2 cache misses. Very hard to do with any non trivial application. Also, SW QoS above 1G is very hard to do with modern CPU's. Especially with multiple flows. It maybe possible with something like FQ Codel which only keeps small amount of state.
RE: [PATCH v6 1/9] eal: generic 64 bit counter
> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Sunday, 19 May 2024 17.14 > > On Sat, 18 May 2024 16:00:55 +0200 > Morten Brørup wrote: > > > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > > Sent: Friday, 17 May 2024 18.18 > > > > > > On Fri, 17 May 2024 08:44:42 +0200 > > > Morten Brørup wrote: > > > > > > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > > > > > Sent: Friday, 17 May 2024 06.27 > > > > > > > > > > + Bruce for feedback on x86 architecture > > > > > > > > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger > > > > > > > > wrote: > > > > > > > > > > > > On Fri, 17 May 2024 02:45:12 + > > > > > > Honnappa Nagarahalli wrote: > > > > > > > > > > > >>> + * A counter is 64 bit value that is safe from split > read/write > > > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a > time > > > > > >> If we are defining the counter in this manner, then > > > implementation cannot > > > > > be generic. I think architectures will have constraints if they > have > > > to ensure > > > > > the 64b variables are not split. > > > > > >> > > > > > >> I think we at least need the counter to be aligned on 8B > boundary > > > to have > > > > > generic code. > > > > > > > > > > > > The C standard has always guaranteed that read and write to > > > unsigned log > > > > > will not be split. > > > > > As I understand, this is true only if the variable is an atomic > > > variable. If > > > > > not, there is no difference between atomic variables and non- > atomic > > > variables. > > > > > > > > > > > Therefore if arch is 64 bit native there is no need for > atomics > > > > > At least on ARM architecture, if the variable is not aligned on > 8B > > > boundary, > > > > > the load or store are not atomic. I am sure it is the same on > other > > > > > architectures. > > > > > > After reading this: Who's afraid of a big bad optimizing compiler? > > > https://lwn.net/Articles/793253/ > > > > Very interesting article! > > > > > > > > Looks like you are right, and atomic or read/write once is required. > > > > I don't like the performance tradeoff (for 64 bit architectures) in > the v7 patch. > > For single-tread updated counters, we MUST have a high performance > counter_add(), not using atomic read-modify-write. > > The fundamental issue is that for SW drivers, having a totally safe > reset function requires > an atomic operation in the fast path for increment. Otherwise the > following (highly unlikely) race > is possible: > > CPU A CPU B > load counter (value = X) > store counter = 0 > store counter (value = X + 1) > Yes, this is why I suggest the "offset" method, option 3. Depending on which CPU wins the race, reset() will set "offset" to either X or X + 1 using your example here. "offset" is atomic, so reading it cannot race writing it. And thus: if counter = X was visible at reset(), fetch() will return counter - offset = X - X = 0, and if counter = X + 1 was visible at reset(), fetch() will return counter - offset = (X + 1) - (X + 1) = 0. > > > > > IMO calling counter_fetch() MUST be possible from another thread. > > This requires that the fast path thread stores the counter atomically > (or using volatile), to ensure that the counter is not only kept in a > CPU register, but stored in memory visible by other threads. > > > > For counter_reset(), we have multiple options: > > 0. Don't support counter resetting. Not really on option? > > We could reject it in the SW drivers. But better not to. Agree. > > > 1. Only allow calling counter_reset() in the fast path thread updating > the counter. This introduces no further requirements. > > Not a good restriction Not good, but the alternatives seem to be worse. > > > 2. Allow calling counter_reset() from another thread, thereby zeroing > the "counter" variable. This introduces a requirement for the "counter" > to be thread-safe, so counter_add() must atomically read-modify-write > the counter, which has a performance cost. Bad for performance. > > > 3. Allow calling counter_reset() from another thread, and introduce an > "offset" variable for counter_fetch() and counter_reset() to provide > thread-safe fetch/reset from other threads, using the consume-release > pattern. > > Too confusing Using offsets for pseudo-reset is a common design pattern. And the performance is excellent. Perhaps the Linux kernel doesn't use this design pattern, but that is not a valid reason to disqualify it. > > > > > I don't like option 2. > > I consider counters too important and frequently used in the fast > path, to compromise on performance for counters. > > Agree. > > > > > For counters updated by multiple fast path threads, > atomic_fetch_add_explicit() of the "counter" variable seems unavoidable. > > Not at all worried about overhead in slow (fetch) path. Agree.
Re: [PATCH v6 1/9] eal: generic 64 bit counter
On Sat, 18 May 2024 16:00:55 +0200 Morten Brørup wrote: > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > Sent: Friday, 17 May 2024 18.18 > > > > On Fri, 17 May 2024 08:44:42 +0200 > > Morten Brørup wrote: > > > > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > > > > Sent: Friday, 17 May 2024 06.27 > > > > > > > > + Bruce for feedback on x86 architecture > > > > > > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger > > > > > > wrote: > > > > > > > > > > On Fri, 17 May 2024 02:45:12 + > > > > > Honnappa Nagarahalli wrote: > > > > > > > > > >>> + * A counter is 64 bit value that is safe from split read/write > > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time > > > > >> If we are defining the counter in this manner, then > > implementation cannot > > > > be generic. I think architectures will have constraints if they have > > to ensure > > > > the 64b variables are not split. > > > > >> > > > > >> I think we at least need the counter to be aligned on 8B boundary > > to have > > > > generic code. > > > > > > > > > > The C standard has always guaranteed that read and write to > > unsigned log > > > > will not be split. > > > > As I understand, this is true only if the variable is an atomic > > variable. If > > > > not, there is no difference between atomic variables and non-atomic > > variables. > > > > > > > > > Therefore if arch is 64 bit native there is no need for atomics > > > > At least on ARM architecture, if the variable is not aligned on 8B > > boundary, > > > > the load or store are not atomic. I am sure it is the same on other > > > > architectures. > > > > After reading this: Who's afraid of a big bad optimizing compiler? > > https://lwn.net/Articles/793253/ > > Very interesting article! > > > > > Looks like you are right, and atomic or read/write once is required. > > I don't like the performance tradeoff (for 64 bit architectures) in the v7 > patch. > For single-tread updated counters, we MUST have a high performance > counter_add(), not using atomic read-modify-write. The fundamental issue is that for SW drivers, having a totally safe reset function requires an atomic operation in the fast path for increment. Otherwise the following (highly unlikely) race is possible: CPU A CPU B load counter (value = X) store counter = 0 store counter (value = X + 1) > > IMO calling counter_fetch() MUST be possible from another thread. > This requires that the fast path thread stores the counter atomically (or > using volatile), to ensure that the counter is not only kept in a CPU > register, but stored in memory visible by other threads. > > For counter_reset(), we have multiple options: > 0. Don't support counter resetting. Not really on option? We could reject it in the SW drivers. But better not to. > 1. Only allow calling counter_reset() in the fast path thread updating the > counter. This introduces no further requirements. Not a good restriction > 2. Allow calling counter_reset() from another thread, thereby zeroing the > "counter" variable. This introduces a requirement for the "counter" to be > thread-safe, so counter_add() must atomically read-modify-write the counter, > which has a performance cost. > 3. Allow calling counter_reset() from another thread, and introduce an > "offset" variable for counter_fetch() and counter_reset() to provide > thread-safe fetch/reset from other threads, using the consume-release pattern. Too confusing > > I don't like option 2. > I consider counters too important and frequently used in the fast path, to > compromise on performance for counters. Agree. > > For counters updated by multiple fast path threads, > atomic_fetch_add_explicit() of the "counter" variable seems unavoidable. Not at all worried about overhead in slow (fetch) path. > > > Perhaps introducing rte_read_once and rte_write_once is good idea? > > Several drivers are implementing it already. > > The read_once/write_once are easier to use, but they lack the flexibility > (regarding barriers and locking) provided by their atomic_explicit > alternatives, which will impact performance in some use cases. They solve the compiler reordering problem but do nothing about cpu ordering. Also, there is no such increment. > > We should strive for the highest possible performance, which means that we > shouldn't introduce functions or design patterns preventing this. > Please note: Security vs. performance is another matter - we certainly don't > want to promote insecure code for the benefit of performance. But for "ease > of coding" vs. performance, I prefer performance. > > That said, I agree that introducing rte_read/write_once functions for use by > drivers to access hardware makes sense, to eliminate
RE: [PATCH v6 1/9] eal: generic 64 bit counter
> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Friday, 17 May 2024 18.18 > > On Fri, 17 May 2024 08:44:42 +0200 > Morten Brørup wrote: > > > > From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > > > Sent: Friday, 17 May 2024 06.27 > > > > > > + Bruce for feedback on x86 architecture > > > > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger > > > > wrote: > > > > > > > > On Fri, 17 May 2024 02:45:12 + > > > > Honnappa Nagarahalli wrote: > > > > > > > >>> + * A counter is 64 bit value that is safe from split read/write > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time > > > >> If we are defining the counter in this manner, then > implementation cannot > > > be generic. I think architectures will have constraints if they have > to ensure > > > the 64b variables are not split. > > > >> > > > >> I think we at least need the counter to be aligned on 8B boundary > to have > > > generic code. > > > > > > > > The C standard has always guaranteed that read and write to > unsigned log > > > will not be split. > > > As I understand, this is true only if the variable is an atomic > variable. If > > > not, there is no difference between atomic variables and non-atomic > variables. > > > > > > > Therefore if arch is 64 bit native there is no need for atomics > > > At least on ARM architecture, if the variable is not aligned on 8B > boundary, > > > the load or store are not atomic. I am sure it is the same on other > > > architectures. > > After reading this: Who's afraid of a big bad optimizing compiler? > https://lwn.net/Articles/793253/ Very interesting article! > > Looks like you are right, and atomic or read/write once is required. I don't like the performance tradeoff (for 64 bit architectures) in the v7 patch. For single-tread updated counters, we MUST have a high performance counter_add(), not using atomic read-modify-write. IMO calling counter_fetch() MUST be possible from another thread. This requires that the fast path thread stores the counter atomically (or using volatile), to ensure that the counter is not only kept in a CPU register, but stored in memory visible by other threads. For counter_reset(), we have multiple options: 0. Don't support counter resetting. Not really on option? 1. Only allow calling counter_reset() in the fast path thread updating the counter. This introduces no further requirements. 2. Allow calling counter_reset() from another thread, thereby zeroing the "counter" variable. This introduces a requirement for the "counter" to be thread-safe, so counter_add() must atomically read-modify-write the counter, which has a performance cost. 3. Allow calling counter_reset() from another thread, and introduce an "offset" variable for counter_fetch() and counter_reset() to provide thread-safe fetch/reset from other threads, using the consume-release pattern. I don't like option 2. I consider counters too important and frequently used in the fast path, to compromise on performance for counters. For counters updated by multiple fast path threads, atomic_fetch_add_explicit() of the "counter" variable seems unavoidable. > Perhaps introducing rte_read_once and rte_write_once is good idea? > Several drivers are implementing it already. The read_once/write_once are easier to use, but they lack the flexibility (regarding barriers and locking) provided by their atomic_explicit alternatives, which will impact performance in some use cases. We should strive for the highest possible performance, which means that we shouldn't introduce functions or design patterns preventing this. Please note: Security vs. performance is another matter - we certainly don't want to promote insecure code for the benefit of performance. But for "ease of coding" vs. performance, I prefer performance. That said, I agree that introducing rte_read/write_once functions for use by drivers to access hardware makes sense, to eliminate copy-paste variants in drivers. But how can we prevent such functions from being used for other purposes, where atomic_explicit should be used?
RE: [PATCH v6 1/9] eal: generic 64 bit counter
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Friday, 17 May 2024 06.27 > > + Bruce for feedback on x86 architecture > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger > wrote: > > > > On Fri, 17 May 2024 02:45:12 + > > Honnappa Nagarahalli wrote: > > > >>> + * A counter is 64 bit value that is safe from split read/write > >>> + * on 32 bit platforms. It assumes that only one cpu at a time > >> If we are defining the counter in this manner, then implementation cannot > be generic. I think architectures will have constraints if they have to ensure > the 64b variables are not split. > >> > >> I think we at least need the counter to be aligned on 8B boundary to have > generic code. > > > > The C standard has always guaranteed that read and write to unsigned log > will not be split. > As I understand, this is true only if the variable is an atomic variable. If > not, there is no difference between atomic variables and non-atomic variables. > > > Therefore if arch is 64 bit native there is no need for atomics > At least on ARM architecture, if the variable is not aligned on 8B boundary, > the load or store are not atomic. I am sure it is the same on other > architectures. I guess it depends on the architecture's natural alignment size and the compiler - especially on 32 bit architectures, where the natural alignment size is 4 bytes. We could play it safe and add alignment to the counter type: #include #ifdef RTE_ARCH_64 #if alignof(uint64_t) < sizeof(uint64_t) typedef alignas(8) uint64_t rte_counter64_t; #else typedef uint64_t rte_counter64_t; #endif #else #if alignof(RTE_ATOMIC(uint64_t)) < sizeof(uint64_t) typedef alignas(8) RTE_ATOMIC(uint64_t) rte_counter64_t; #else typedef RTE_ATOMIC(uint64_t) rte_counter64_t; #endif #endif This above is intentionally verbose, because some 32 bit architectures can implement 64 bit non-tearing counters without atomics [1], and the new 64 bit counter library should be prepared for accepting such optimizations. [1]: https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/ Picking up on another branch of this discussion: This 64 bit counter library is *not* an internal API. Applications should use it for their counters too. > Bruce, any comments for x86?
Re: [PATCH v6 1/9] eal: generic 64 bit counter
+ Bruce for feedback on x86 architecture > On May 16, 2024, at 10:30 PM, Stephen Hemminger > wrote: > > On Fri, 17 May 2024 02:45:12 + > Honnappa Nagarahalli wrote: > >>> + * A counter is 64 bit value that is safe from split read/write >>> + * on 32 bit platforms. It assumes that only one cpu at a time >> If we are defining the counter in this manner, then implementation cannot be >> generic. I think architectures will have constraints if they have to ensure >> the 64b variables are not split. >> >> I think we at least need the counter to be aligned on 8B boundary to have >> generic code. > > The C standard has always guaranteed that read and write to unsigned log will > not be split. As I understand, this is true only if the variable is an atomic variable. If not, there is no difference between atomic variables and non-atomic variables. > Therefore if arch is 64 bit native there is no need for atomics At least on ARM architecture, if the variable is not aligned on 8B boundary, the load or store are not atomic. I am sure it is the same on other architectures. Bruce, any comments for x86?
Re: [PATCH v6 1/9] eal: generic 64 bit counter
> On May 16, 2024, at 7:12 PM, Stephen Hemminger > wrote: > > This header implements 64 bit counters that are NOT atomic > but are safe against load/store splits on 32 bit platforms. > > Signed-off-by: Stephen Hemminger > Acked-by: Morten Brørup > --- > lib/eal/include/meson.build | 1 + > lib/eal/include/rte_counter.h | 98 +++ > 2 files changed, 99 insertions(+) > create mode 100644 lib/eal/include/rte_counter.h > > diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build > index e94b056d46..c070dd0079 100644 > --- a/lib/eal/include/meson.build > +++ b/lib/eal/include/meson.build > @@ -12,6 +12,7 @@ headers += files( > 'rte_class.h', > 'rte_common.h', > 'rte_compat.h', > +'rte_counter.h', > 'rte_debug.h', > 'rte_dev.h', > 'rte_devargs.h', > diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h > new file mode 100644 > index 00..d623195d63 > --- /dev/null > +++ b/lib/eal/include/rte_counter.h > @@ -0,0 +1,98 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright (c) Stephen Hemminger > + */ > + > +#ifndef _RTE_COUNTER_H_ > +#define _RTE_COUNTER_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +/** > + * @file > + * RTE Counter > + * > + * A counter is 64 bit value that is safe from split read/write > + * on 32 bit platforms. It assumes that only one cpu at a time If we are defining the counter in this manner, then implementation cannot be generic. I think architectures will have constraints if they have to ensure the 64b variables are not split. I think we at least need the counter to be aligned on 8B boundary to have generic code. > + * will update the counter, and another CPU may want to read it. > + * > + * This is a much weaker guarantee than full atomic variables > + * but is faster since no locked operations are required for update. > + */ > + > +#ifdef RTE_ARCH_64 > +/* > + * On a platform that can support native 64 bit type, no special handling. > + * These are just wrapper around 64 bit value. > + */ > +typedef uint64_t rte_counter64_t; > + > +/** > + * Add value to counter. > + */ > +__rte_experimental > +static inline void > +rte_counter64_add(rte_counter64_t *counter, uint32_t val) > +{ > + *counter += val; > +} > + > +__rte_experimental > +static inline uint64_t > +rte_counter64_fetch(const rte_counter64_t *counter) > +{ > + return *counter; > +} > + > +__rte_experimental > +static inline void > +rte_counter64_set(rte_counter64_t *counter, uint64_t val) > +{ > + *counter = val; > +} > + > +#else > + > +#include > + > +/* > + * On a 32 bit platform need to use atomic to force the compler to not > + * split 64 bit read/write. > + */ > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t; > + > +__rte_experimental > +static inline void > +rte_counter64_add(rte_counter64_t *counter, uint32_t val) > +{ > + rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed); > +} > + > +__rte_experimental > +static inline uint64_t > +rte_counter64_fetch(const rte_counter64_t *counter) > +{ > + return rte_atomic_load_explicit(counter, rte_memory_order_consume); > +} > + > +__rte_experimental > +static inline void > +rte_counter64_set(rte_counter64_t *counter, uint64_t val) > +{ > + rte_atomic_store_explicit(counter, val, rte_memory_order_release); > +} > +#endif > + > +__rte_experimental > +static inline void > +rte_counter64_reset(rte_counter64_t *counter) > +{ > + rte_counter64_set(counter, 0); > +} > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_COUNTER_H_ */ > -- > 2.43.0 >