Re: Optimizing kernel compilation / alignments for network performance

2022-05-10 Thread Rafał Miłecki

On 6.05.2022 11:44, Arnd Bergmann wrote:

On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki  wrote:

On 6.05.2022 10:45, Arnd Bergmann wrote:

On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki  wrote:

With
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 2 speeds:
284 Mbps / 408 Mbps


Can you try using 'numactl -C' to pin the iperf processes to
a particular CPU core? This may be related to the locality of
the user process relative to where the interrupts end up.


I run iperf on x86 machines connected to router's WAN and LAN ports.
It's meant to emulate end user just downloading from / uploading to
Internet some data.

Router's only task is doing masquarade NAT here.


Ah, makes sense. Can you observe the CPU usage to be on
a particular core in the slow vs fast case then?


With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 311 Mb/s (CPUs load: 100% + 0%)
b) 408 Mb/s (CPUs load: 100% + 62%)

With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 290 Mb/s (CPUs load: 100% + 0%)
b) 410 Mb/s (CPUs load: 100% + 63%)

With echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was stable:
a) 372 Mb/s (CPUs load: 100% + 26%)
b) 375 Mb/s (CPUs load: 82% + 100%)

With echo 3 > /sys/class/net/eth0/queues/rx-0/rps_cpus
NAT speed was verying between:
a) 293 Mb/s (CPUs load: 100% + 0%)
b) 332 Mb/s (CPUs load: 100% + 17%)
c) 374 Mb/s (CPUs load: 81% + 100%)
d) 442 Mb/s (CPUs load: 100% + 75%)



After some extra debugging I found a reason for varying CPU usage &
varying NAT speeds.

My router has a single swtich so I use two VLANs:
eth0.1 - LAN
eth0.2 - WAN
(VLAN traffic is routed to correct ports by switch). On top of that I
have "br-lan" bridge interface briding eth0.1 and wireless interfaces.

For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set
to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1.

So if I assign specific CPU core to each of two interfaces, e.g.:
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus
things get stable.

With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf
session.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-10 Thread Rafał Miłecki

On 6.05.2022 10:45, Arnd Bergmann wrote:

On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki  wrote:


On 5.05.2022 18:04, Andrew Lunn wrote:

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup


There is a lot of cache management functions here.


Indeed, so optimizing the coherency management (see Felix' reply)
is likely to help most in making the driver faster, but that does not
explain why the alignment of the object code has such a big impact
on performance.

To investigate the alignment further, what I was actually looking for
is a comparison of the profile of the slow and fast case. Here I would
expect that the slow case spends more time in one of the functions
that don't deal with cache management (maybe fib_table_lookup or
__netif_receive_skb_core).

A few other thoughts:

- bcma_host_soc_read32() is a fundamentally slow operation, maybe
   some of the calls can turned into a relaxed read, like the readback
   in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
   though obviously not the one in bgmac_dma_rx_read().
   It may be possible to even avoid some of the reads entirely, checking
   for more data in bgmac_poll() may actually be counterproductive
   depending on the workload.


I'll experiment with that, hopefully I can optimize it a bit.



- The higher-end networking SoCs are usually cache-coherent and
   can avoid the cache management entirely. There is a slim chance
   that this chip is designed that way and it just needs to be enabled
   properly. Most low-end chips don't implement the coherent
   interconnect though, and I suppose you have checked this already.


To my best knowledge Northstar platform doesn't support hw coherency.

I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.

config BCM_GLB_COHERENCY
bool "Global Hardware Cache Coherency"
default n
depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 
|| BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855



- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
   to have an extraneous dma_wmb(), which should be implied by the
   non-relaxed writel() in bgmac_write().


I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s


I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.

With bgmac_read(): 421 Mb/s
Without: 413 Mb/s



- accesses to the DMA descriptor don't show up in the profile here,
   but look like they can get misoptimized by the compiler. I would
   generally use READ_ONCE() and WRITE_ONCE() for these to
   ensure that you don't end up with extra or out-of-order accesses.
   This also makes it clearer to the reader that something special
   happens here.


Should I use something as below?

FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s


diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index 87700072..ce98f2a9 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct 
bgmac_dma_ring *ring,

slot = >slots[i];
dma_desc = >cpu_base[i];
-   dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr));
-   dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr));
-   dma_desc->ctl0 = cpu_to_le32(ctl0);
-   dma_desc->ctl1 = cpu_to_le32(ctl1);
+   WRITE_ONCE(dma_desc->addr_low, 
cpu_to_le32(lower_32_bits(slot->dma_addr)));
+   WRITE_ONCE(dma_desc->addr_high, 
cpu_to_le32(upper_32_bits(slot->dma_addr)));
+   WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+   WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));
 }

 static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,
@@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac,
 * B43_DMA64_DCTL1_ADDREXT_MASK;
 */

-   dma_desc->addr_low = 
cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr));
-   dma_desc->addr_high = 
cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr));
-   dma_desc->ctl0 = cpu_to_le32(ctl0);
-   dma_desc->ctl1 = cpu_to_le32(ctl1);
+   WRITE_ONCE(dma_desc->addr_low, 
cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)));
+   WRITE_ONCE(dma_desc->addr_high, 
cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)));
+   WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+   WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));

ring->end = desc_idx;
 }

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-10 Thread Rafał Miłecki

On 6.05.2022 14:42, Andrew Lunn wrote:

I just took a quick look at the driver. It allocates and maps rx buffers that 
can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
This seems rather excessive, especially since most people are going to use a 
MTU of 1500.
My proposal would be to add support for making rx buffer size dependent on MTU, 
reallocating the ring on MTU changes.
This should significantly reduce the time spent on flushing caches.


Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03

It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).

I do all my testing with
#define BGMAC_RX_MAX_FRAME_SIZE 1536


That helps show that cache operations are part of your bottleneck.

Taking a quick look at the driver. On the receive side:

/* Unmap buffer to make it accessible to the CPU */
 dma_unmap_single(dma_dev, dma_addr,
  BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);

Here is data is mapped read for the CPU to use it.

/* Get info from the header */
 len = le16_to_cpu(rx->len);
 flags = le16_to_cpu(rx->flags);

 /* Check for poison and drop or pass the packet */
 if (len == 0xdead && flags == 0xbeef) {
 netdev_err(bgmac->net_dev, "Found poisoned packet 
at slot %d, DMA issue!\n",
ring->start);
 put_page(virt_to_head_page(buf));
 bgmac->net_dev->stats.rx_errors++;
 break;
 }

 if (len > BGMAC_RX_ALLOC_SIZE) {
 netdev_err(bgmac->net_dev, "Found oversized packet 
at slot %d, DMA issue!\n",
ring->start);
 put_page(virt_to_head_page(buf));
 bgmac->net_dev->stats.rx_length_errors++;
 bgmac->net_dev->stats.rx_errors++;
 break;
 }

 /* Omit CRC. */
 len -= ETH_FCS_LEN;

 skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
 if (unlikely(!skb)) {
 netdev_err(bgmac->net_dev, "build_skb 
failed\n");
 put_page(virt_to_head_page(buf));
 bgmac->net_dev->stats.rx_errors++;
 break;
 }
 skb_put(skb, BGMAC_RX_FRAME_OFFSET +
 BGMAC_RX_BUF_OFFSET + len);
 skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
  BGMAC_RX_BUF_OFFSET);

 skb_checksum_none_assert(skb);
 skb->protocol = eth_type_trans(skb, bgmac->net_dev);

and this is the first access of the actual data. You can make the
cache actually work for you, rather than against you, to adding a call to

prefetch(buf);

just after the dma_unmap_single(). That will start getting the frame
header from DRAM into cache, so hopefully it is available by the time
eth_type_trans() is called and you don't have a cache miss.



I don't think that analysis is correct.

Please take a look at following lines:
struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
void *buf = slot->buf;

The first we do after dma_unmap_single() call is rx->len read. That
actually points to DMA data. There is nothing we could keep CPU busy
with while preteching data.

FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
a single 1 Mb/s. Speed was exactly the same as without prefetch() call.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-08 Thread Rafał Miłecki

On 6.05.2022 09:44, Rafał Miłecki wrote:

On 5.05.2022 18:04, Andrew Lunn wrote:

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup


There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.


It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
more stable now (lower variations). Let me spend some time on more
testing.


For a context I test various kernel commits / configs using:
iperf -t 120 -i 10 -c 192.168.13.1


I did more testing with # CONFIG_SMP is not set

Good thing:
During a single iperf session I get noticably more stable speed.
With SMP: x ± 2,86%
Without SMP: x ± 0,96%

Bad thing:
Across kernel commits / config changes speed still varies.


So disabling CONFIG_SMP won't help me looking for kernel regressions.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-06 Thread Andrew Lunn
> > I just took a quick look at the driver. It allocates and maps rx buffers 
> > that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> > This seems rather excessive, especially since most people are going to use 
> > a MTU of 1500.
> > My proposal would be to add support for making rx buffer size dependent on 
> > MTU, reallocating the ring on MTU changes.
> > This should significantly reduce the time spent on flushing caches.
> 
> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> configure MTU and add support for frames beyond 8192 byte size"):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> 
> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> 
> I do all my testing with
> #define BGMAC_RX_MAX_FRAME_SIZE   1536

That helps show that cache operations are part of your bottleneck.

Taking a quick look at the driver. On the receive side:

   /* Unmap buffer to make it accessible to the CPU */
dma_unmap_single(dma_dev, dma_addr,
 BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);

Here is data is mapped read for the CPU to use it.

/* Get info from the header */
len = le16_to_cpu(rx->len);
flags = le16_to_cpu(rx->flags);

/* Check for poison and drop or pass the packet */
if (len == 0xdead && flags == 0xbeef) {
netdev_err(bgmac->net_dev, "Found poisoned 
packet at slot %d, DMA issue!\n",
   ring->start);
put_page(virt_to_head_page(buf));
bgmac->net_dev->stats.rx_errors++;
break;
}

if (len > BGMAC_RX_ALLOC_SIZE) {
netdev_err(bgmac->net_dev, "Found oversized 
packet at slot %d, DMA issue!\n",
   ring->start);
put_page(virt_to_head_page(buf));
bgmac->net_dev->stats.rx_length_errors++;
bgmac->net_dev->stats.rx_errors++;
break;
}

/* Omit CRC. */
len -= ETH_FCS_LEN;

skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
if (unlikely(!skb)) {
netdev_err(bgmac->net_dev, "build_skb 
failed\n");
put_page(virt_to_head_page(buf));
bgmac->net_dev->stats.rx_errors++;
break;
}
skb_put(skb, BGMAC_RX_FRAME_OFFSET +
BGMAC_RX_BUF_OFFSET + len);
skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
 BGMAC_RX_BUF_OFFSET);

skb_checksum_none_assert(skb);
skb->protocol = eth_type_trans(skb, bgmac->net_dev);

and this is the first access of the actual data. You can make the
cache actually work for you, rather than against you, to adding a call to

prefetch(buf);

just after the dma_unmap_single(). That will start getting the frame
header from DRAM into cache, so hopefully it is available by the time
eth_type_trans() is called and you don't have a cache miss.

Andrew

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-06 Thread Rafał Miłecki

On 6.05.2022 10:45, Arnd Bergmann wrote:

On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki  wrote:


On 5.05.2022 18:04, Andrew Lunn wrote:

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup


There is a lot of cache management functions here.


Indeed, so optimizing the coherency management (see Felix' reply)
is likely to help most in making the driver faster, but that does not
explain why the alignment of the object code has such a big impact
on performance.

To investigate the alignment further, what I was actually looking for
is a comparison of the profile of the slow and fast case. Here I would
expect that the slow case spends more time in one of the functions
that don't deal with cache management (maybe fib_table_lookup or
__netif_receive_skb_core).

A few other thoughts:

- bcma_host_soc_read32() is a fundamentally slow operation, maybe
   some of the calls can turned into a relaxed read, like the readback
   in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
   though obviously not the one in bgmac_dma_rx_read().
   It may be possible to even avoid some of the reads entirely, checking
   for more data in bgmac_poll() may actually be counterproductive
   depending on the workload.

- The higher-end networking SoCs are usually cache-coherent and
   can avoid the cache management entirely. There is a slim chance
   that this chip is designed that way and it just needs to be enabled
   properly. Most low-end chips don't implement the coherent
   interconnect though, and I suppose you have checked this already.

- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
   to have an extraneous dma_wmb(), which should be implied by the
   non-relaxed writel() in bgmac_write().

- accesses to the DMA descriptor don't show up in the profile here,
   but look like they can get misoptimized by the compiler. I would
   generally use READ_ONCE() and WRITE_ONCE() for these to
   ensure that you don't end up with extra or out-of-order accesses.
   This also makes it clearer to the reader that something special
   happens here.


Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.


It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
more stable now (lower variations). Let me spend some time on more
testing.


FWIW during all my tests I was using:
echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
that is what I need to get similar speeds across iperf sessions

With
echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 4 speeds:
273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
(every time I started iperf kernel jumped into one state and kept the
   same iperf speed until stopping it and starting another session)

With
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 2 speeds:
284 Mbps / 408 Mbps


Can you try using 'numactl -C' to pin the iperf processes to
a particular CPU core? This may be related to the locality of
the user process relative to where the interrupts end up.


I run iperf on x86 machines connected to router's WAN and LAN ports.
It's meant to emulate end user just downloading from / uploading to
Internet some data.

Router's only task is doing masquarade NAT here.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-06 Thread Rafał Miłecki

On 5.05.2022 18:46, Felix Fietkau wrote:


On 05.05.22 18:04, Andrew Lunn wrote:

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup


There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.

I've also found that some Ethernet drivers invalidate or flush too
much. If you are sending a 64 byte TCP ACK, all you need to flush is
64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
recycle the buffer, all you need to invalidate is the size of the ACK,
so long as you can guarantee nothing has touched the memory above it.
But you need to be careful when implementing tricks like this, or you
can get subtle corruption bugs when you get it wrong.

I just took a quick look at the driver. It allocates and maps rx buffers that 
can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
This seems rather excessive, especially since most people are going to use a 
MTU of 1500.
My proposal would be to add support for making rx buffer size dependent on MTU, 
reallocating the ring on MTU changes.
This should significantly reduce the time spent on flushing caches.


Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03

It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).

I do all my testing with
#define BGMAC_RX_MAX_FRAME_SIZE 1536

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-06 Thread Rafał Miłecki

On 5.05.2022 18:04, Andrew Lunn wrote:

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup


There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.


It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels
more stable now (lower variations). Let me spend some time on more
testing.


FWIW during all my tests I was using:
echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus
that is what I need to get similar speeds across iperf sessions

With
echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 4 speeds:
273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps
(every time I started iperf kernel jumped into one state and kept the
 same iperf speed until stopping it and starting another session)

With
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
my NAT speeds were jumping between 2 speeds:
284 Mbps / 408 Mbps



I've also found that some Ethernet drivers invalidate or flush too
much. If you are sending a 64 byte TCP ACK, all you need to flush is
64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
recycle the buffer, all you need to invalidate is the size of the ACK,
so long as you can guarantee nothing has touched the memory above it.
But you need to be careful when implementing tricks like this, or you
can get subtle corruption bugs when you get it wrong.


That was actually bgmac's initial behaviour, see commit 92b9ccd34a90
("bgmac: pass received packet to the netif instead of copying it"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92b9ccd34a9053c628d230fe27a7e0c10179910f

I think it was Felix who suggested me to avoid skb_copy*() and it seems
it improved performance indeed.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-05 Thread Felix Fietkau



On 05.05.22 18:04, Andrew Lunn wrote:

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup


There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.

I've also found that some Ethernet drivers invalidate or flush too
much. If you are sending a 64 byte TCP ACK, all you need to flush is
64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
recycle the buffer, all you need to invalidate is the size of the ACK,
so long as you can guarantee nothing has touched the memory above it.
But you need to be careful when implementing tricks like this, or you
can get subtle corruption bugs when you get it wrong.
I just took a quick look at the driver. It allocates and maps rx buffers 
that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
This seems rather excessive, especially since most people are going to 
use a MTU of 1500.
My proposal would be to add support for making rx buffer size dependent 
on MTU, reallocating the ring on MTU changes.

This should significantly reduce the time spent on flushing caches.

- Felix

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-05 Thread Andrew Lunn
> you'll see that most used functions are:
> v7_dma_inv_range
> __irqentry_text_end
> l2c210_inv_range
> v7_dma_clean_range
> bcma_host_soc_read32
> __netif_receive_skb_core
> arch_cpu_idle
> l2c210_clean_range
> fib_table_lookup

There is a lot of cache management functions here. Might sound odd,
but have you tried disabling SMP? These cache functions need to
operate across all CPUs, and the communication between CPUs can slow
them down. If there is only one CPU, these cache functions get simpler
and faster.

It just depends on your workload. If you have 1 CPU loaded to 100% and
the other 3 idle, you might see an improvement. If you actually need
more than one CPU, it will probably be worse.

I've also found that some Ethernet drivers invalidate or flush too
much. If you are sending a 64 byte TCP ACK, all you need to flush is
64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then
recycle the buffer, all you need to invalidate is the size of the ACK,
so long as you can guarantee nothing has touched the memory above it.
But you need to be careful when implementing tricks like this, or you
can get subtle corruption bugs when you get it wrong.

Andrew

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-05-05 Thread Rafał Miłecki

On 29.04.2022 16:49, Arnd Bergmann wrote:

On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki  wrote:

On 27.04.2022 14:56, Alexander Lobakin wrote:



Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
manually.


1. Without ce5013ff3bec and with -falign-functions=32
387 Mb/s

2. Without ce5013ff3bec and with -falign-functions=64
377 Mb/s

3. With ce5013ff3bec and with -falign-functions=32
384 Mb/s

4. With ce5013ff3bec and with -falign-functions=64
377 Mb/s


So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed


I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.


Note that the problem may not just be the alignment of a particular
function, but also how different function map into your cache.
The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
64KB, with a line size of 32 bytes. If you are unlucky and you get
five different functions that are frequently called and are a multiple
functions are exactly the wrong spacing that they need more than
four ways, calling them in sequence would always evict the other
ones. The same could of course happen if the problem is the D-cache
or the L2.

Can you try to get a profile using 'perf record' to see where most
time is spent, in both the slowest and the fastest versions?
If the instruction cache is the issue, you should see how the hottest
addresses line up.


Your explanation sounds sane of course.

If you take a look at my old e-mail
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup

Is there a way to optimize kernel for optimal cache usage of selected
(above) functions?


Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks
reported as worth trying. It's another randomness. It stabilizes NAT
performance across some commits and breaks stability across others.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-04-29 Thread Arnd Bergmann
On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki  wrote:
> On 27.04.2022 14:56, Alexander Lobakin wrote:

> Thank you Alexander, this appears to be helpful! I decided to ignore
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
> manually.
>
>
> 1. Without ce5013ff3bec and with -falign-functions=32
> 387 Mb/s
>
> 2. Without ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
> 3. With ce5013ff3bec and with -falign-functions=32
> 384 Mb/s
>
> 4. With ce5013ff3bec and with -falign-functions=64
> 377 Mb/s
>
>
> So it seems that:
> 1. -falign-functions=32 = pretty stable high speed
> 2. -falign-functions=64 = very stable slightly lower speed
>
>
> I'm going to perform tests on more commits but if it stays so reliable
> as above that will be a huge success for me.

Note that the problem may not just be the alignment of a particular
function, but also how different function map into your cache.
The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
64KB, with a line size of 32 bytes. If you are unlucky and you get
five different functions that are frequently called and are a multiple
functions are exactly the wrong spacing that they need more than
four ways, calling them in sequence would always evict the other
ones. The same could of course happen if the problem is the D-cache
or the L2.

Can you try to get a profile using 'perf record' to see where most
time is spent, in both the slowest and the fastest versions?
If the instruction cache is the issue, you should see how the hottest
addresses line up.

Arnd

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-04-29 Thread Rafał Miłecki

On 27.04.2022 19:31, Rafał Miłecki wrote:

On 27.04.2022 14:56, Alexander Lobakin wrote:

From: Rafał Miłecki 
Date: Wed, 27 Apr 2022 14:04:54 +0200


I noticed years ago that kernel changes touching code - that I don't use
at all - can affect network performance for me.

I work with home routers based on Broadcom Northstar platform. Those
are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
those devices is NAT masquerade and that is what I test with iperf
running on two x86 machines.

***

Example of such unused code change:
ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).

I first reported that issue it in the e-mail thread:
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
unicast headers")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).

***

It appears Northstar CPUs have little cache size and so any change in
location of kernel symbols can affect NAT performance. That explains why
changing unrelated code affects anything & it has been partially proven
aligning some of cache-v7.S code.

My question is: is there a way to find out & force an optimal symbols
locations?


Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
fighting with the same issue on some Realtek MIPS boards: random
code changes in random kernel core parts were affecting NAT /
network performance. This option resolved this I'd say, for the cost
of slightly increased vmlinux size (almost no change in vmlinuz
size).
The only thing is that it was recently restricted to a set of
architectures and MIPS and ARM32 are not included now lol. So it's
either a matter of expanding the list (since it was restricted only
because `-falign-functions=` is not supported on some architectures)
or you can just do:

make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size

The actual alignment is something to play with, I stopped on the
cacheline size, 32 in my case.
Also, this does not provide any guarantees that you won't suffer
from random data cacheline changes. There were some initiatives to
introduce debug alignment of data as well, but since function are
often bigger than 32, while variables are usually much smaller, it
was increasing the vmlinux size by a ton (imagine each u32 variable
occupying 32-64 bytes instead of 4). But the chance of catching this
is much lower than to suffer from I-cache function misplacement.


Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
manually.


1. Without ce5013ff3bec and with -falign-functions=32
387 Mb/s

2. Without ce5013ff3bec and with -falign-functions=64
377 Mb/s

3. With ce5013ff3bec and with -falign-functions=32
384 Mb/s

4. With ce5013ff3bec and with -falign-functions=64
377 Mb/s


So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed


I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.


So sadly that doesn't work all the time. Or maybe just works randomly.

I tried multiple commits with both: -falign-functions=32 and
-falign-functions=64 . I still get speed variations. About 30 Mb/s in
total. From commit to commit it's usually about 3% but skipping few can
result in up to 30 Mb/s (almost 10%).

Similarly to code changes performance also gets affected by enabling /
disabling kernel config options. I noticed that enabling
CONFIG_CRYPTO_PCRYPT may decrease *or* increase speed depending on
-falign-functions (and depending on kernel commit surely too).

┌──┬───┬──┬───┐
│  │ no PCRYPT │ PCRYPT=y │ diff  │
├──┼───┼──┼───┤
│ No -falign-functions │ 363 Mb/s  │ 370 Mb/s │ +2%   │
│ -falign-functions=32 │ 364 Mb/s  │ 370 Mb/s │ +1,7% │
│ -falign-functions=64 │ 372 Mb/s  │ 365 Mb/s │ -2%   │
└──┴───┴──┴───┘

So I still don't have a reliable way of testing kernel changes for speed
regressions :(

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-04-27 Thread Rafał Miłecki

On 27.04.2022 14:56, Alexander Lobakin wrote:

From: Rafał Miłecki 
Date: Wed, 27 Apr 2022 14:04:54 +0200


I noticed years ago that kernel changes touching code - that I don't use
at all - can affect network performance for me.

I work with home routers based on Broadcom Northstar platform. Those
are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
those devices is NAT masquerade and that is what I test with iperf
running on two x86 machines.

***

Example of such unused code change:
ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).

I first reported that issue it in the e-mail thread:
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
unicast headers")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).

***

It appears Northstar CPUs have little cache size and so any change in
location of kernel symbols can affect NAT performance. That explains why
changing unrelated code affects anything & it has been partially proven
aligning some of cache-v7.S code.

My question is: is there a way to find out & force an optimal symbols
locations?


Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
fighting with the same issue on some Realtek MIPS boards: random
code changes in random kernel core parts were affecting NAT /
network performance. This option resolved this I'd say, for the cost
of slightly increased vmlinux size (almost no change in vmlinuz
size).
The only thing is that it was recently restricted to a set of
architectures and MIPS and ARM32 are not included now lol. So it's
either a matter of expanding the list (since it was restricted only
because `-falign-functions=` is not supported on some architectures)
or you can just do:

make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size

The actual alignment is something to play with, I stopped on the
cacheline size, 32 in my case.
Also, this does not provide any guarantees that you won't suffer
from random data cacheline changes. There were some initiatives to
introduce debug alignment of data as well, but since function are
often bigger than 32, while variables are usually much smaller, it
was increasing the vmlinux size by a ton (imagine each u32 variable
occupying 32-64 bytes instead of 4). But the chance of catching this
is much lower than to suffer from I-cache function misplacement.


Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
manually.


1. Without ce5013ff3bec and with -falign-functions=32
387 Mb/s

2. Without ce5013ff3bec and with -falign-functions=64
377 Mb/s

3. With ce5013ff3bec and with -falign-functions=32
384 Mb/s

4. With ce5013ff3bec and with -falign-functions=64
377 Mb/s


So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed


I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Optimizing kernel compilation / alignments for network performance

2022-04-27 Thread Alexander Lobakin
From: Rafał Miłecki 
Date: Wed, 27 Apr 2022 14:04:54 +0200

> Hi,

Hej,

> 
> I noticed years ago that kernel changes touching code - that I don't use
> at all - can affect network performance for me.
> 
> I work with home routers based on Broadcom Northstar platform. Those
> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
> those devices is NAT masquerade and that is what I test with iperf
> running on two x86 machines.
> 
> ***
> 
> Example of such unused code change:
> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
> 
> I first reported that issue it in the e-mail thread:
> ARM router NAT performance affected by random/unrelated commits
> https://lkml.org/lkml/2019/5/21/349
> https://www.spinics.net/lists/linux-block/msg40624.html
> 
> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
> unicast headers")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
> 
> ***
> 
> It appears Northstar CPUs have little cache size and so any change in
> location of kernel symbols can affect NAT performance. That explains why
> changing unrelated code affects anything & it has been partially proven
> aligning some of cache-v7.S code.
> 
> My question is: is there a way to find out & force an optimal symbols
> locations?

Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
fighting with the same issue on some Realtek MIPS boards: random
code changes in random kernel core parts were affecting NAT /
network performance. This option resolved this I'd say, for the cost
of slightly increased vmlinux size (almost no change in vmlinuz
size).
The only thing is that it was recently restricted to a set of
architectures and MIPS and ARM32 are not included now lol. So it's
either a matter of expanding the list (since it was restricted only
because `-falign-functions=` is not supported on some architectures)
or you can just do:

make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size

The actual alignment is something to play with, I stopped on the
cacheline size, 32 in my case.
Also, this does not provide any guarantees that you won't suffer
from random data cacheline changes. There were some initiatives to
introduce debug alignment of data as well, but since function are
often bigger than 32, while variables are usually much smaller, it
was increasing the vmlinux size by a ton (imagine each u32 variable
occupying 32-64 bytes instead of 4). But the chance of catching this
is much lower than to suffer from I-cache function misplacement.

> 
> Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
> out what other functions are worth optimizing (aligning) and force that
> (I guess  __attribute__((aligned(32))) could be used).
> 
> I can't really draw any conclusions from comparing System.map before and
> after above commits as they relocate thousands of symbols in one go.
> 
> Optimizing is pretty important for me for two reasons:
> 1. I want to reach maximum possible NAT masquerade performance
> 2. I need stable performance across random commits to detect regressions

[0] 
https://elixir.bootlin.com/linux/v5.18-rc4/K/ident/CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Thanks,
Al

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel