Re: Optimizing kernel compilation / alignments for network performance
On 6.05.2022 11:44, Arnd Bergmann wrote: On Fri, May 6, 2022 at 10:55 AM Rafał Miłecki wrote: On 6.05.2022 10:45, Arnd Bergmann wrote: On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki wrote: With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 2 speeds: 284 Mbps / 408 Mbps Can you try using 'numactl -C' to pin the iperf processes to a particular CPU core? This may be related to the locality of the user process relative to where the interrupts end up. I run iperf on x86 machines connected to router's WAN and LAN ports. It's meant to emulate end user just downloading from / uploading to Internet some data. Router's only task is doing masquarade NAT here. Ah, makes sense. Can you observe the CPU usage to be on a particular core in the slow vs fast case then? With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was verying between: a) 311 Mb/s (CPUs load: 100% + 0%) b) 408 Mb/s (CPUs load: 100% + 62%) With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was verying between: a) 290 Mb/s (CPUs load: 100% + 0%) b) 410 Mb/s (CPUs load: 100% + 63%) With echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was stable: a) 372 Mb/s (CPUs load: 100% + 26%) b) 375 Mb/s (CPUs load: 82% + 100%) With echo 3 > /sys/class/net/eth0/queues/rx-0/rps_cpus NAT speed was verying between: a) 293 Mb/s (CPUs load: 100% + 0%) b) 332 Mb/s (CPUs load: 100% + 17%) c) 374 Mb/s (CPUs load: 81% + 100%) d) 442 Mb/s (CPUs load: 100% + 75%) After some extra debugging I found a reason for varying CPU usage & varying NAT speeds. My router has a single swtich so I use two VLANs: eth0.1 - LAN eth0.2 - WAN (VLAN traffic is routed to correct ports by switch). On top of that I have "br-lan" bridge interface briding eth0.1 and wireless interfaces. For all that time I had /sys/class/net/br-lan/queues/rx-0/rps_cpus set to 3. So bridge traffic was randomly handled by CPU 0 or CPU 1. So if I assign specific CPU core to each of two interfaces, e.g.: echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus echo 2 > /sys/class/net/br-lan/queues/rx-0/rps_cpus things get stable. With above I get stable 419 Mb/s (CPUs load: 100% + 64%) on every iperf session. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 6.05.2022 10:45, Arnd Bergmann wrote: On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki wrote: On 5.05.2022 18:04, Andrew Lunn wrote: you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup There is a lot of cache management functions here. Indeed, so optimizing the coherency management (see Felix' reply) is likely to help most in making the driver faster, but that does not explain why the alignment of the object code has such a big impact on performance. To investigate the alignment further, what I was actually looking for is a comparison of the profile of the slow and fast case. Here I would expect that the slow case spends more time in one of the functions that don't deal with cache management (maybe fib_table_lookup or __netif_receive_skb_core). A few other thoughts: - bcma_host_soc_read32() is a fundamentally slow operation, maybe some of the calls can turned into a relaxed read, like the readback in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(), though obviously not the one in bgmac_dma_rx_read(). It may be possible to even avoid some of the reads entirely, checking for more data in bgmac_poll() may actually be counterproductive depending on the workload. I'll experiment with that, hopefully I can optimize it a bit. - The higher-end networking SoCs are usually cache-coherent and can avoid the cache management entirely. There is a slim chance that this chip is designed that way and it just needs to be enabled properly. Most low-end chips don't implement the coherent interconnect though, and I suppose you have checked this already. To my best knowledge Northstar platform doesn't support hw coherency. I just took an extra look at Broadcom's SDK and them seem to have some driver for selected chipsets but BCM708 isn't there. config BCM_GLB_COHERENCY bool "Global Hardware Cache Coherency" default n depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146 || BCM94912 || BCM96813 || BCM96756 || BCM96855 - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear to have an extraneous dma_wmb(), which should be implied by the non-relaxed writel() in bgmac_write(). I tried dropping wmb() calls. With wmb(): 421 Mb/s Without: 418 Mb/s I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which seems to be a flushing readback. With bgmac_read(): 421 Mb/s Without: 413 Mb/s - accesses to the DMA descriptor don't show up in the profile here, but look like they can get misoptimized by the compiler. I would generally use READ_ONCE() and WRITE_ONCE() for these to ensure that you don't end up with extra or out-of-order accesses. This also makes it clearer to the reader that something special happens here. Should I use something as below? FWIW it doesn't seem to change NAT performance. Without WRITE_ONCE: 421 Mb/s With: 419 Mb/s diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c index 87700072..ce98f2a9 100644 --- a/drivers/net/ethernet/broadcom/bgmac.c +++ b/drivers/net/ethernet/broadcom/bgmac.c @@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct bgmac_dma_ring *ring, slot = >slots[i]; dma_desc = >cpu_base[i]; - dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr)); - dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr)); - dma_desc->ctl0 = cpu_to_le32(ctl0); - dma_desc->ctl1 = cpu_to_le32(ctl1); + WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(slot->dma_addr))); + WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(slot->dma_addr))); + WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0)); + WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1)); } static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac, @@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac, * B43_DMA64_DCTL1_ADDREXT_MASK; */ - dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)); - dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)); - dma_desc->ctl0 = cpu_to_le32(ctl0); - dma_desc->ctl1 = cpu_to_le32(ctl1); + WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr))); + WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr))); + WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0)); + WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1)); ring->end = desc_idx; } ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 6.05.2022 14:42, Andrew Lunn wrote: I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. This seems rather excessive, especially since most people are going to use a MTU of 1500. My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. This should significantly reduce the time spent on flushing caches. Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). I do all my testing with #define BGMAC_RX_MAX_FRAME_SIZE 1536 That helps show that cache operations are part of your bottleneck. Taking a quick look at the driver. On the receive side: /* Unmap buffer to make it accessible to the CPU */ dma_unmap_single(dma_dev, dma_addr, BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); Here is data is mapped read for the CPU to use it. /* Get info from the header */ len = le16_to_cpu(rx->len); flags = le16_to_cpu(rx->flags); /* Check for poison and drop or pass the packet */ if (len == 0xdead && flags == 0xbeef) { netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } if (len > BGMAC_RX_ALLOC_SIZE) { netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_length_errors++; bgmac->net_dev->stats.rx_errors++; break; } /* Omit CRC. */ len -= ETH_FCS_LEN; skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); if (unlikely(!skb)) { netdev_err(bgmac->net_dev, "build_skb failed\n"); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } skb_put(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET + len); skb_pull(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET); skb_checksum_none_assert(skb); skb->protocol = eth_type_trans(skb, bgmac->net_dev); and this is the first access of the actual data. You can make the cache actually work for you, rather than against you, to adding a call to prefetch(buf); just after the dma_unmap_single(). That will start getting the frame header from DRAM into cache, so hopefully it is available by the time eth_type_trans() is called and you don't have a cache miss. I don't think that analysis is correct. Please take a look at following lines: struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET; void *buf = slot->buf; The first we do after dma_unmap_single() call is rx->len read. That actually points to DMA data. There is nothing we could keep CPU busy with while preteching data. FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by a single 1 Mb/s. Speed was exactly the same as without prefetch() call. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 6.05.2022 09:44, Rafał Miłecki wrote: On 5.05.2022 18:04, Andrew Lunn wrote: you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup There is a lot of cache management functions here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels more stable now (lower variations). Let me spend some time on more testing. For a context I test various kernel commits / configs using: iperf -t 120 -i 10 -c 192.168.13.1 I did more testing with # CONFIG_SMP is not set Good thing: During a single iperf session I get noticably more stable speed. With SMP: x ± 2,86% Without SMP: x ± 0,96% Bad thing: Across kernel commits / config changes speed still varies. So disabling CONFIG_SMP won't help me looking for kernel regressions. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
> > I just took a quick look at the driver. It allocates and maps rx buffers > > that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. > > This seems rather excessive, especially since most people are going to use > > a MTU of 1500. > > My proposal would be to add support for making rx buffer size dependent on > > MTU, reallocating the ring on MTU changes. > > This should significantly reduce the time spent on flushing caches. > > Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: > configure MTU and add support for frames beyond 8192 byte size"): > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 > > It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). > > I do all my testing with > #define BGMAC_RX_MAX_FRAME_SIZE 1536 That helps show that cache operations are part of your bottleneck. Taking a quick look at the driver. On the receive side: /* Unmap buffer to make it accessible to the CPU */ dma_unmap_single(dma_dev, dma_addr, BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); Here is data is mapped read for the CPU to use it. /* Get info from the header */ len = le16_to_cpu(rx->len); flags = le16_to_cpu(rx->flags); /* Check for poison and drop or pass the packet */ if (len == 0xdead && flags == 0xbeef) { netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } if (len > BGMAC_RX_ALLOC_SIZE) { netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_length_errors++; bgmac->net_dev->stats.rx_errors++; break; } /* Omit CRC. */ len -= ETH_FCS_LEN; skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); if (unlikely(!skb)) { netdev_err(bgmac->net_dev, "build_skb failed\n"); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } skb_put(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET + len); skb_pull(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET); skb_checksum_none_assert(skb); skb->protocol = eth_type_trans(skb, bgmac->net_dev); and this is the first access of the actual data. You can make the cache actually work for you, rather than against you, to adding a call to prefetch(buf); just after the dma_unmap_single(). That will start getting the frame header from DRAM into cache, so hopefully it is available by the time eth_type_trans() is called and you don't have a cache miss. Andrew ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 6.05.2022 10:45, Arnd Bergmann wrote: On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki wrote: On 5.05.2022 18:04, Andrew Lunn wrote: you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup There is a lot of cache management functions here. Indeed, so optimizing the coherency management (see Felix' reply) is likely to help most in making the driver faster, but that does not explain why the alignment of the object code has such a big impact on performance. To investigate the alignment further, what I was actually looking for is a comparison of the profile of the slow and fast case. Here I would expect that the slow case spends more time in one of the functions that don't deal with cache management (maybe fib_table_lookup or __netif_receive_skb_core). A few other thoughts: - bcma_host_soc_read32() is a fundamentally slow operation, maybe some of the calls can turned into a relaxed read, like the readback in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(), though obviously not the one in bgmac_dma_rx_read(). It may be possible to even avoid some of the reads entirely, checking for more data in bgmac_poll() may actually be counterproductive depending on the workload. - The higher-end networking SoCs are usually cache-coherent and can avoid the cache management entirely. There is a slim chance that this chip is designed that way and it just needs to be enabled properly. Most low-end chips don't implement the coherent interconnect though, and I suppose you have checked this already. - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear to have an extraneous dma_wmb(), which should be implied by the non-relaxed writel() in bgmac_write(). - accesses to the DMA descriptor don't show up in the profile here, but look like they can get misoptimized by the compiler. I would generally use READ_ONCE() and WRITE_ONCE() for these to ensure that you don't end up with extra or out-of-order accesses. This also makes it clearer to the reader that something special happens here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels more stable now (lower variations). Let me spend some time on more testing. FWIW during all my tests I was using: echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus that is what I need to get similar speeds across iperf sessions With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 4 speeds: 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps (every time I started iperf kernel jumped into one state and kept the same iperf speed until stopping it and starting another session) With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 2 speeds: 284 Mbps / 408 Mbps Can you try using 'numactl -C' to pin the iperf processes to a particular CPU core? This may be related to the locality of the user process relative to where the interrupts end up. I run iperf on x86 machines connected to router's WAN and LAN ports. It's meant to emulate end user just downloading from / uploading to Internet some data. Router's only task is doing masquarade NAT here. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 5.05.2022 18:46, Felix Fietkau wrote: On 05.05.22 18:04, Andrew Lunn wrote: you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup There is a lot of cache management functions here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. I've also found that some Ethernet drivers invalidate or flush too much. If you are sending a 64 byte TCP ACK, all you need to flush is 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then recycle the buffer, all you need to invalidate is the size of the ACK, so long as you can guarantee nothing has touched the memory above it. But you need to be careful when implementing tricks like this, or you can get subtle corruption bugs when you get it wrong. I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. This seems rather excessive, especially since most people are going to use a MTU of 1500. My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. This should significantly reduce the time spent on flushing caches. Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). I do all my testing with #define BGMAC_RX_MAX_FRAME_SIZE 1536 ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 5.05.2022 18:04, Andrew Lunn wrote: you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup There is a lot of cache management functions here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. It seems to lower my NAT speed from ~362 Mb/s to 320 Mb/s but it feels more stable now (lower variations). Let me spend some time on more testing. FWIW during all my tests I was using: echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus that is what I need to get similar speeds across iperf sessions With echo 0 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 4 speeds: 273 Mbps / 315 Mbps / 353 Mbps / 425 Mbps (every time I started iperf kernel jumped into one state and kept the same iperf speed until stopping it and starting another session) With echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus my NAT speeds were jumping between 2 speeds: 284 Mbps / 408 Mbps I've also found that some Ethernet drivers invalidate or flush too much. If you are sending a 64 byte TCP ACK, all you need to flush is 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then recycle the buffer, all you need to invalidate is the size of the ACK, so long as you can guarantee nothing has touched the memory above it. But you need to be careful when implementing tricks like this, or you can get subtle corruption bugs when you get it wrong. That was actually bgmac's initial behaviour, see commit 92b9ccd34a90 ("bgmac: pass received packet to the netif instead of copying it"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92b9ccd34a9053c628d230fe27a7e0c10179910f I think it was Felix who suggested me to avoid skb_copy*() and it seems it improved performance indeed. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 05.05.22 18:04, Andrew Lunn wrote: you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup There is a lot of cache management functions here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. I've also found that some Ethernet drivers invalidate or flush too much. If you are sending a 64 byte TCP ACK, all you need to flush is 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then recycle the buffer, all you need to invalidate is the size of the ACK, so long as you can guarantee nothing has touched the memory above it. But you need to be careful when implementing tricks like this, or you can get subtle corruption bugs when you get it wrong. I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. This seems rather excessive, especially since most people are going to use a MTU of 1500. My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. This should significantly reduce the time spent on flushing caches. - Felix ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
> you'll see that most used functions are: > v7_dma_inv_range > __irqentry_text_end > l2c210_inv_range > v7_dma_clean_range > bcma_host_soc_read32 > __netif_receive_skb_core > arch_cpu_idle > l2c210_clean_range > fib_table_lookup There is a lot of cache management functions here. Might sound odd, but have you tried disabling SMP? These cache functions need to operate across all CPUs, and the communication between CPUs can slow them down. If there is only one CPU, these cache functions get simpler and faster. It just depends on your workload. If you have 1 CPU loaded to 100% and the other 3 idle, you might see an improvement. If you actually need more than one CPU, it will probably be worse. I've also found that some Ethernet drivers invalidate or flush too much. If you are sending a 64 byte TCP ACK, all you need to flush is 64 bytes, not the full 1500 MTU. If you receive a TCP ACK, and then recycle the buffer, all you need to invalidate is the size of the ACK, so long as you can guarantee nothing has touched the memory above it. But you need to be careful when implementing tricks like this, or you can get subtle corruption bugs when you get it wrong. Andrew ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 29.04.2022 16:49, Arnd Bergmann wrote: On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki wrote: On 27.04.2022 14:56, Alexander Lobakin wrote: Thank you Alexander, this appears to be helpful! I decided to ignore CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS manually. 1. Without ce5013ff3bec and with -falign-functions=32 387 Mb/s 2. Without ce5013ff3bec and with -falign-functions=64 377 Mb/s 3. With ce5013ff3bec and with -falign-functions=32 384 Mb/s 4. With ce5013ff3bec and with -falign-functions=64 377 Mb/s So it seems that: 1. -falign-functions=32 = pretty stable high speed 2. -falign-functions=64 = very stable slightly lower speed I'm going to perform tests on more commits but if it stays so reliable as above that will be a huge success for me. Note that the problem may not just be the alignment of a particular function, but also how different function map into your cache. The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or 64KB, with a line size of 32 bytes. If you are unlucky and you get five different functions that are frequently called and are a multiple functions are exactly the wrong spacing that they need more than four ways, calling them in sequence would always evict the other ones. The same could of course happen if the problem is the D-cache or the L2. Can you try to get a profile using 'perf record' to see where most time is spent, in both the slowest and the fastest versions? If the instruction cache is the issue, you should see how the hottest addresses line up. Your explanation sounds sane of course. If you take a look at my old e-mail ARM router NAT performance affected by random/unrelated commits https://lkml.org/lkml/2019/5/21/349 https://www.spinics.net/lists/linux-block/msg40624.html you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup Is there a way to optimize kernel for optimal cache usage of selected (above) functions? Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks reported as worth trying. It's another randomness. It stabilizes NAT performance across some commits and breaks stability across others. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki wrote: > On 27.04.2022 14:56, Alexander Lobakin wrote: > Thank you Alexander, this appears to be helpful! I decided to ignore > CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS > manually. > > > 1. Without ce5013ff3bec and with -falign-functions=32 > 387 Mb/s > > 2. Without ce5013ff3bec and with -falign-functions=64 > 377 Mb/s > > 3. With ce5013ff3bec and with -falign-functions=32 > 384 Mb/s > > 4. With ce5013ff3bec and with -falign-functions=64 > 377 Mb/s > > > So it seems that: > 1. -falign-functions=32 = pretty stable high speed > 2. -falign-functions=64 = very stable slightly lower speed > > > I'm going to perform tests on more commits but if it stays so reliable > as above that will be a huge success for me. Note that the problem may not just be the alignment of a particular function, but also how different function map into your cache. The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or 64KB, with a line size of 32 bytes. If you are unlucky and you get five different functions that are frequently called and are a multiple functions are exactly the wrong spacing that they need more than four ways, calling them in sequence would always evict the other ones. The same could of course happen if the problem is the D-cache or the L2. Can you try to get a profile using 'perf record' to see where most time is spent, in both the slowest and the fastest versions? If the instruction cache is the issue, you should see how the hottest addresses line up. Arnd ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 27.04.2022 19:31, Rafał Miłecki wrote: On 27.04.2022 14:56, Alexander Lobakin wrote: From: Rafał Miłecki Date: Wed, 27 Apr 2022 14:04:54 +0200 I noticed years ago that kernel changes touching code - that I don't use at all - can affect network performance for me. I work with home routers based on Broadcom Northstar platform. Those are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of those devices is NAT masquerade and that is what I test with iperf running on two x86 machines. *** Example of such unused code change: ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). I first reported that issue it in the e-mail thread: ARM router NAT performance affected by random/unrelated commits https://lkml.org/lkml/2019/5/21/349 https://www.spinics.net/lists/linux-block/msg40624.html Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv unicast headers") https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). *** It appears Northstar CPUs have little cache size and so any change in location of kernel symbols can affect NAT performance. That explains why changing unrelated code affects anything & it has been partially proven aligning some of cache-v7.S code. My question is: is there a way to find out & force an optimal symbols locations? Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been fighting with the same issue on some Realtek MIPS boards: random code changes in random kernel core parts were affecting NAT / network performance. This option resolved this I'd say, for the cost of slightly increased vmlinux size (almost no change in vmlinuz size). The only thing is that it was recently restricted to a set of architectures and MIPS and ARM32 are not included now lol. So it's either a matter of expanding the list (since it was restricted only because `-falign-functions=` is not supported on some architectures) or you can just do: make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size The actual alignment is something to play with, I stopped on the cacheline size, 32 in my case. Also, this does not provide any guarantees that you won't suffer from random data cacheline changes. There were some initiatives to introduce debug alignment of data as well, but since function are often bigger than 32, while variables are usually much smaller, it was increasing the vmlinux size by a ton (imagine each u32 variable occupying 32-64 bytes instead of 4). But the chance of catching this is much lower than to suffer from I-cache function misplacement. Thank you Alexander, this appears to be helpful! I decided to ignore CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS manually. 1. Without ce5013ff3bec and with -falign-functions=32 387 Mb/s 2. Without ce5013ff3bec and with -falign-functions=64 377 Mb/s 3. With ce5013ff3bec and with -falign-functions=32 384 Mb/s 4. With ce5013ff3bec and with -falign-functions=64 377 Mb/s So it seems that: 1. -falign-functions=32 = pretty stable high speed 2. -falign-functions=64 = very stable slightly lower speed I'm going to perform tests on more commits but if it stays so reliable as above that will be a huge success for me. So sadly that doesn't work all the time. Or maybe just works randomly. I tried multiple commits with both: -falign-functions=32 and -falign-functions=64 . I still get speed variations. About 30 Mb/s in total. From commit to commit it's usually about 3% but skipping few can result in up to 30 Mb/s (almost 10%). Similarly to code changes performance also gets affected by enabling / disabling kernel config options. I noticed that enabling CONFIG_CRYPTO_PCRYPT may decrease *or* increase speed depending on -falign-functions (and depending on kernel commit surely too). ┌──┬───┬──┬───┐ │ │ no PCRYPT │ PCRYPT=y │ diff │ ├──┼───┼──┼───┤ │ No -falign-functions │ 363 Mb/s │ 370 Mb/s │ +2% │ │ -falign-functions=32 │ 364 Mb/s │ 370 Mb/s │ +1,7% │ │ -falign-functions=64 │ 372 Mb/s │ 365 Mb/s │ -2% │ └──┴───┴──┴───┘ So I still don't have a reliable way of testing kernel changes for speed regressions :( ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
On 27.04.2022 14:56, Alexander Lobakin wrote: From: Rafał Miłecki Date: Wed, 27 Apr 2022 14:04:54 +0200 I noticed years ago that kernel changes touching code - that I don't use at all - can affect network performance for me. I work with home routers based on Broadcom Northstar platform. Those are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of those devices is NAT masquerade and that is what I test with iperf running on two x86 machines. *** Example of such unused code change: ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). I first reported that issue it in the e-mail thread: ARM router NAT performance affected by random/unrelated commits https://lkml.org/lkml/2019/5/21/349 https://www.spinics.net/lists/linux-block/msg40624.html Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv unicast headers") https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). *** It appears Northstar CPUs have little cache size and so any change in location of kernel symbols can affect NAT performance. That explains why changing unrelated code affects anything & it has been partially proven aligning some of cache-v7.S code. My question is: is there a way to find out & force an optimal symbols locations? Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been fighting with the same issue on some Realtek MIPS boards: random code changes in random kernel core parts were affecting NAT / network performance. This option resolved this I'd say, for the cost of slightly increased vmlinux size (almost no change in vmlinuz size). The only thing is that it was recently restricted to a set of architectures and MIPS and ARM32 are not included now lol. So it's either a matter of expanding the list (since it was restricted only because `-falign-functions=` is not supported on some architectures) or you can just do: make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size The actual alignment is something to play with, I stopped on the cacheline size, 32 in my case. Also, this does not provide any guarantees that you won't suffer from random data cacheline changes. There were some initiatives to introduce debug alignment of data as well, but since function are often bigger than 32, while variables are usually much smaller, it was increasing the vmlinux size by a ton (imagine each u32 variable occupying 32-64 bytes instead of 4). But the chance of catching this is much lower than to suffer from I-cache function misplacement. Thank you Alexander, this appears to be helpful! I decided to ignore CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS manually. 1. Without ce5013ff3bec and with -falign-functions=32 387 Mb/s 2. Without ce5013ff3bec and with -falign-functions=64 377 Mb/s 3. With ce5013ff3bec and with -falign-functions=32 384 Mb/s 4. With ce5013ff3bec and with -falign-functions=64 377 Mb/s So it seems that: 1. -falign-functions=32 = pretty stable high speed 2. -falign-functions=64 = very stable slightly lower speed I'm going to perform tests on more commits but if it stays so reliable as above that will be a huge success for me. ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Optimizing kernel compilation / alignments for network performance
From: Rafał Miłecki Date: Wed, 27 Apr 2022 14:04:54 +0200 > Hi, Hej, > > I noticed years ago that kernel changes touching code - that I don't use > at all - can affect network performance for me. > > I work with home routers based on Broadcom Northstar platform. Those > are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of > those devices is NAT masquerade and that is what I test with iperf > running on two x86 machines. > > *** > > Example of such unused code change: > ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A"). > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b > It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%). > > I first reported that issue it in the e-mail thread: > ARM router NAT performance affected by random/unrelated commits > https://lkml.org/lkml/2019/5/21/349 > https://www.spinics.net/lists/linux-block/msg40624.html > > Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv > unicast headers") > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283 > that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%). > > *** > > It appears Northstar CPUs have little cache size and so any change in > location of kernel symbols can affect NAT performance. That explains why > changing unrelated code affects anything & it has been partially proven > aligning some of cache-v7.S code. > > My question is: is there a way to find out & force an optimal symbols > locations? Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been fighting with the same issue on some Realtek MIPS boards: random code changes in random kernel core parts were affecting NAT / network performance. This option resolved this I'd say, for the cost of slightly increased vmlinux size (almost no change in vmlinuz size). The only thing is that it was recently restricted to a set of architectures and MIPS and ARM32 are not included now lol. So it's either a matter of expanding the list (since it was restricted only because `-falign-functions=` is not supported on some architectures) or you can just do: make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size The actual alignment is something to play with, I stopped on the cacheline size, 32 in my case. Also, this does not provide any guarantees that you won't suffer from random data cacheline changes. There were some initiatives to introduce debug alignment of data as well, but since function are often bigger than 32, while variables are usually much smaller, it was increasing the vmlinux size by a ton (imagine each u32 variable occupying 32-64 bytes instead of 4). But the chance of catching this is much lower than to suffer from I-cache function misplacement. > > Adding .align 5 to the cache-v7.S is a partial success. I'd like to find > out what other functions are worth optimizing (aligning) and force that > (I guess __attribute__((aligned(32))) could be used). > > I can't really draw any conclusions from comparing System.map before and > after above commits as they relocate thousands of symbols in one go. > > Optimizing is pretty important for me for two reasons: > 1. I want to reach maximum possible NAT masquerade performance > 2. I need stable performance across random commits to detect regressions [0] https://elixir.bootlin.com/linux/v5.18-rc4/K/ident/CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B Thanks, Al ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel