Re: Testing network / NAT performance

2022-07-03 Thread Rafał Miłecki

On 12.06.2022 21:58, Rafał Miłecki wrote:

6. Organizing kernel symbols

    CPUs of home routers usually have small caches. The way kernel
    symbols get organized during compilation may significantly affect
    network performance [3]. It's especially annoying as network
    unrelated changes may move / reorder symbols and affect cache hits &
    misses.

    There isn't a reliable solution for that. It may help to add:
    -falign-functions=32 or
    -falign-functions=64 (depending on platform).
    using e.g. KBUILD_CFLAGS.


I'll provide an example of a really annoying behaviour I've just
debugged. I noticed a NAT speed regression when switching from kernel
5.10 to 5.15. I narrowed it down to the 5.14 → 5.15 switch and then
started bisecting process.

I debugged that following commit:
4c00e1e2e58ee Merge tag 'linux-watchdog-5.15-rc1' of 
git://www.linux-watchdog.org/linux-watchdog
dropped NAT speed from ~938 Mb/s down to 907 Mb/s.

*

Here comes interesting part: regression isn't present in the commit
41e73feb10249 ("dt-bindings: watchdog: Add compatible for Mediatek
MT7986") - the last commit in the merged branch (tag).

It means that merged code affects NAT performance only on top of the
previous commit 192ad3c27a489 ("Merge tag 'for-linus' of
git://git.kernel.org/pub/scm/virt/kvm/kvm").

I kept debugging and discovered that reverting dbe80cf471f94 ("watchdog:
Start watchdog in watchdog_set_last_hw_keepalive only if appropriate")
brings back high NAT speed.

*

Another interesting part: cherry-picking above commit on top of the
192ad3c27a489 ("Merge tag 'for-linus' of
git://git.kernel.org/pub/scm/virt/kvm/kvm") does nothing (no NAT
regression). Further debugging revealed another commit required to
trigger regression: 60bcd91aafd22 ("watchdog: introduce
watchdog_dev_suspend/resume"). Cherry-picking both on top of kvm affects
NAT performance.

*

Finally (even more fun):

1. Cherry picking both commits on top of v5.14 does nothing (does not
   break NAT performance).

2. Reverting both commits from v5.15 doesn't fix regression.

So all that watchdog thing is just some kind of a glitch. It makes
debugging an actual regression a really painful process. It breaks
reliability of automated testing.

All of that happens with -falign-functions=32 and I'm not aware of any
workaround for such issues.

FWIW: actual regression seems to be caused by one of commits introduced
by the 626bf91a292e2 ("Merge tag 'net-5.15-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net").

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Testing network / NAT performance

2022-06-17 Thread Ansuel Smith
Il giorno ven 17 giu 2022 alle ore 13:51 Hauke Mehrtens
 ha scritto:
>
> Hi Rafal,
>
> Thank you for your detailed analyses and also for the detailed report.
> This is very helpful when I ran into this problem.
>
> Can we somehow automate it so that we get notified a day after a bad
> change was committed about performance regression and not one year after?
>
> On 6/14/22 15:16, Rafał Miłecki wrote:
> > On 12.06.2022 21:58, Rafał Miłecki wrote:
> >> 5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4")
> >>
> >> Improved network speed by 25% (256 Mb/s → 320 Mb/s).
> >>
> >> I didn't have time to bisect this *improvement* to a single kernel
> >> commit. I tried profiling but it isn't obvious to me what caused that
> >> improvement.
> >>
> >> Kernel 4.19:
> >>  11.94%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> v7_dma_inv_range
> >>   7.06%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> l2c210_inv_range
> >>   3.37%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> v7_dma_clean_range
> >>   2.80%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> l2c210_clean_range
> >>   2.67%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
> >>   2.63%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> __dev_queue_xmit
> >>   2.43%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> __netif_receive_skb_core
> >>   2.13%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> bgmac_start_xmit
> >>   1.82%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow
> >>   1.54%  ksoftirqd/0  [kernel.kallsyms]   [k] ip_forward
> >>   1.50%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> dma_cache_maint_page
> >>
> >> Kernel 5.4:
> >>  14.53%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> v7_dma_inv_range
> >>   8.02%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> l2c210_inv_range
> >>   3.32%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
> >>   3.28%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> v7_dma_clean_range
> >>   3.12%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> __netif_receive_skb_core
> >>   2.70%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> l2c210_clean_range
> >>   2.46%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> __dev_queue_xmit
> >>   2.26%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> bgmac_start_xmit
> >>   1.73%  ksoftirqd/0  [kernel.kallsyms]   [k]
> >> __dma_page_dev_to_cpu
> >>   1.72%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow
> >
> > Riddle solved. Change to bless/blame: 4e0c54bc5bc8 ("kernel: add support
> > for kernel 5.4").
> >
> > First of all bcm53xx uses
> > CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
> >
> >
> > OpenWrt's kernel Makefile in kernel 4.19:
> >
> > ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
> > KBUILD_CFLAGS+= -Os $(EXTRA_OPTIMIZATION)
> > else
> > KBUILD_CFLAGS   += -O2 -fno-reorder-blocks -fno-tree-ch
> > $(EXTRA_OPTIMIZATION)
> > endif
> >
> >
> > OpenWrt's kernel Makefile in 5.4:
> >
> > ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
> > KBUILD_CFLAGS += -O2 $(EXTRA_OPTIMIZATION)
> > else ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3
> > KBUILD_CFLAGS += -O3 $(EXTRA_OPTIMIZATION)
> > else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
> > KBUILD_CFLAGS += -Os -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION)
> > endif
> >
> >
> > As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks
> > from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE.
>
> This looks like an accident to me.
> All targets except mediatek/mt7629 are setting
> CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE in master. In Openwrt 21.02 the
> ARCHS38 target set CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3, but now it is
> also to normal performance.
>
> We should probably switch mediatek/mt7629 to
> CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE, does anyone have such a device and
> could test a patch?
>
> > I've noticed problem with -fno-reorder-blocks long time ago, see:
> > [PATCH RFC] kernel: drop -fno-reorder-blocks
> > https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/
> >
> >
> > It should really get sorted out...
>
> I would suggest to remove the -fno-reorder-blocks -fno-tree-ch options
> as they are not used.
>
>
> The next step could be Profile-guided optimization:
> https://lwn.net/Articles/830300/
> If the toolchain works properly I expect there big improvements as
> routing, forwarding and NAT is completely in the kernel and we use
> devices with small caches. Profile-guided optimization should be able to
> avoid many cache misses by better packaging the binary.
>

PGO would be a dream to accomplish but it's a nightmare to actually use it.
The kernel size grow a lot and it needs to be done correctly...
Also AFAIK it's not that easy to add support for it and it's
problematic for some
device to generate the profile data.

> Hauke
>
> ___
> openwrt-devel mailing list
> 

Re: Testing network / NAT performance

2022-06-17 Thread Hauke Mehrtens

Hi Rafal,

Thank you for your detailed analyses and also for the detailed report. 
This is very helpful when I ran into this problem.


Can we somehow automate it so that we get notified a day after a bad 
change was committed about performance regression and not one year after?


On 6/14/22 15:16, Rafał Miłecki wrote:

On 12.06.2022 21:58, Rafał Miłecki wrote:

5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4")

Improved network speed by 25% (256 Mb/s → 320 Mb/s).

I didn't have time to bisect this *improvement* to a single kernel
commit. I tried profiling but it isn't obvious to me what caused that
improvement.

Kernel 4.19:
 11.94%  ksoftirqd/0  [kernel.kallsyms]   [k] 
v7_dma_inv_range
  7.06%  ksoftirqd/0  [kernel.kallsyms]   [k] 
l2c210_inv_range
  3.37%  ksoftirqd/0  [kernel.kallsyms]   [k] 
v7_dma_clean_range
  2.80%  ksoftirqd/0  [kernel.kallsyms]   [k] 
l2c210_clean_range

  2.67%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
  2.63%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__dev_queue_xmit
  2.43%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
  2.13%  ksoftirqd/0  [kernel.kallsyms]   [k] 
bgmac_start_xmit

  1.82%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow
  1.54%  ksoftirqd/0  [kernel.kallsyms]   [k] ip_forward
  1.50%  ksoftirqd/0  [kernel.kallsyms]   [k] 
dma_cache_maint_page


Kernel 5.4:
 14.53%  ksoftirqd/0  [kernel.kallsyms]   [k] 
v7_dma_inv_range
  8.02%  ksoftirqd/0  [kernel.kallsyms]   [k] 
l2c210_inv_range

  3.32%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
  3.28%  ksoftirqd/0  [kernel.kallsyms]   [k] 
v7_dma_clean_range
  3.12%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
  2.70%  ksoftirqd/0  [kernel.kallsyms]   [k] 
l2c210_clean_range
  2.46%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__dev_queue_xmit
  2.26%  ksoftirqd/0  [kernel.kallsyms]   [k] 
bgmac_start_xmit
  1.73%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__dma_page_dev_to_cpu

  1.72%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow


Riddle solved. Change to bless/blame: 4e0c54bc5bc8 ("kernel: add support
for kernel 5.4").

First of all bcm53xx uses
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y


OpenWrt's kernel Makefile in kernel 4.19:

ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
KBUILD_CFLAGS    += -Os $(EXTRA_OPTIMIZATION)
else
KBUILD_CFLAGS   += -O2 -fno-reorder-blocks -fno-tree-ch 
$(EXTRA_OPTIMIZATION)

endif


OpenWrt's kernel Makefile in 5.4:

ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
KBUILD_CFLAGS += -O2 $(EXTRA_OPTIMIZATION)
else ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3
KBUILD_CFLAGS += -O3 $(EXTRA_OPTIMIZATION)
else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
KBUILD_CFLAGS += -Os -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION)
endif


As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks
from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE.


This looks like an accident to me.
All targets except mediatek/mt7629 are setting 
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE in master. In Openwrt 21.02 the 
ARCHS38 target set CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3, but now it is 
also to normal performance.


We should probably switch mediatek/mt7629 to 
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE, does anyone have such a device and 
could test a patch?



I've noticed problem with -fno-reorder-blocks long time ago, see:
[PATCH RFC] kernel: drop -fno-reorder-blocks
https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/ 



It should really get sorted out...


I would suggest to remove the -fno-reorder-blocks -fno-tree-ch options 
as they are not used.



The next step could be Profile-guided optimization:
https://lwn.net/Articles/830300/
If the toolchain works properly I expect there big improvements as 
routing, forwarding and NAT is completely in the kernel and we use 
devices with small caches. Profile-guided optimization should be able to 
avoid many cache misses by better packaging the binary.


Hauke

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Fwd: Testing network / NAT performance

2022-06-14 Thread Rui Salvaterra
[Ugh, now with less HTML, sorry about that…]

Hi, Rafał,

On Tue, 14 Jun 2022 at 14:20, Rafał Miłecki  wrote:
>
> As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks
> from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE.
>
> I've noticed problem with -fno-reorder-blocks long time ago, see:
> [PATCH RFC] kernel: drop -fno-reorder-blocks
> https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/
>
> It should really get sorted out...

Why not just drop both fno-reorder-blocks and -fno-tree-ch? I have no
idea about the details, but those options seem to have been carried
forward from a time where GCC probably had issues with them (code
bloat, maybe). I've been carrying a patch in my tree for (about three)
years, dropping both options, with no issues at all in all
architectures (ARM1176JZF-S, 24Kc, 74Kc, 1004Kc, Cortex-A9,
Cortex-A53, x86-64) and GCC versions (8, 9, 10, 11, 12) I've tested.

Cheers,
Rui

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Testing network / NAT performance

2022-06-14 Thread Rafał Miłecki

On 12.06.2022 21:58, Rafał Miłecki wrote:

5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4")

Improved network speed by 25% (256 Mb/s → 320 Mb/s).

I didn't have time to bisect this *improvement* to a single kernel
commit. I tried profiling but it isn't obvious to me what caused that
improvement.

Kernel 4.19:
     11.94%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_inv_range
  7.06%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_inv_range
  3.37%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_clean_range
  2.80%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_clean_range
  2.67%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
  2.63%  ksoftirqd/0  [kernel.kallsyms]   [k] __dev_queue_xmit
  2.43%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
  2.13%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_start_xmit
  1.82%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow
  1.54%  ksoftirqd/0  [kernel.kallsyms]   [k] ip_forward
  1.50%  ksoftirqd/0  [kernel.kallsyms]   [k] dma_cache_maint_page

Kernel 5.4:
     14.53%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_inv_range
  8.02%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_inv_range
  3.32%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
  3.28%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_clean_range
  3.12%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
  2.70%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_clean_range
  2.46%  ksoftirqd/0  [kernel.kallsyms]   [k] __dev_queue_xmit
  2.26%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_start_xmit
  1.73%  ksoftirqd/0  [kernel.kallsyms]   [k] __dma_page_dev_to_cpu
  1.72%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow


Riddle solved. Change to bless/blame: 4e0c54bc5bc8 ("kernel: add support
for kernel 5.4").

First of all bcm53xx uses
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y


OpenWrt's kernel Makefile in kernel 4.19:

ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
KBUILD_CFLAGS   += -Os $(EXTRA_OPTIMIZATION)
else
KBUILD_CFLAGS   += -O2 -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION)
endif


OpenWrt's kernel Makefile in 5.4:

ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
KBUILD_CFLAGS += -O2 $(EXTRA_OPTIMIZATION)
else ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3
KBUILD_CFLAGS += -O3 $(EXTRA_OPTIMIZATION)
else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
KBUILD_CFLAGS += -Os -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION)
endif


As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks
from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE.

I've noticed problem with -fno-reorder-blocks long time ago, see:
[PATCH RFC] kernel: drop -fno-reorder-blocks
https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/

It should really get sorted out...

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: Testing network / NAT performance

2022-06-12 Thread Rafał Miłecki

During last years NAT performance on Northstar (bcm53xx) changed
multiple times. Noone keeps a close eye on this and Northstar testing
results also seem very unstable. During last 2 months I probably tested
over a hundred of OpenWrt commits going back to 2015.

I decided to do testing with -falign-functions=32 and at some point I
disabled CONFIG_SMP. I also did some tests without rtcache patch which
was dropped later anyway. Below I'm sharing my notes.


1. afafbc0d7454 ("kernel: bgmac: add more DMA related fixes")

This commit introduced varying speeds across testing sessions. It seems
that could be caused by the removal of dma_sync_single_for_cpu() which
could make rps_cpus actually work as expected.


2. 39f115707531 ("bcm53xx: switch to kernel 4.4")

Kernel 4.2 introduced commit 66e5133f19e9 ("vlan: Add GRO support for
non hardware accelerated vlan") which lowered Northstar / bgmac
performance as it introduced csum_partial() calls in new code paths [1].

Regression can be workarounded with:

ethtool -K eth0 gro off

(note: DSA requires disabling GRO also for switch ports)


3. 916e33fa1e14 ("netifd: update to the latest version, rewrite RPS/XPS 
handling")

This affected setting rps_cpus and xps_cpus default values. It affected
networking depending on amount of device CPUs and setup.


4. 50c6938b95a0 ("bcm53xx: add v5.4 support")

This commit actually switched bcm53xx from kernel 4.14 to 4.19 which
somehow dropped network speed by 5%. It could be actual net subsystem
change or just something unrelated. Too small difference to make whole
debugging worth it.


5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4")

Improved network speed by 25% (256 Mb/s → 320 Mb/s).

I didn't have time to bisect this *improvement* to a single kernel
commit. I tried profiling but it isn't obvious to me what caused that
improvement.

Kernel 4.19:
11.94%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_inv_range
 7.06%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_inv_range
 3.37%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_clean_range
 2.80%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_clean_range
 2.67%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
 2.63%  ksoftirqd/0  [kernel.kallsyms]   [k] __dev_queue_xmit
 2.43%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
 2.13%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_start_xmit
 1.82%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow
 1.54%  ksoftirqd/0  [kernel.kallsyms]   [k] ip_forward
 1.50%  ksoftirqd/0  [kernel.kallsyms]   [k] dma_cache_maint_page

Kernel 5.4:
14.53%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_inv_range
 8.02%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_inv_range
 3.32%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_poll
 3.28%  ksoftirqd/0  [kernel.kallsyms]   [k] v7_dma_clean_range
 3.12%  ksoftirqd/0  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
 2.70%  ksoftirqd/0  [kernel.kallsyms]   [k] l2c210_clean_range
 2.46%  ksoftirqd/0  [kernel.kallsyms]   [k] __dev_queue_xmit
 2.26%  ksoftirqd/0  [kernel.kallsyms]   [k] bgmac_start_xmit
 1.73%  ksoftirqd/0  [kernel.kallsyms]   [k] __dma_page_dev_to_cpu
 1.72%  ksoftirqd/0  [kernel.kallsyms]   [k] nf_hook_slow


6. ba72ed537c4a ("kernel: backport GRO improvements")

Improved network speed by 10%.


7. 17576b1b2aea ("kernel: drop the conntrack rtcache patch")

Dropped network speed by 15%.


8. f55f1dbaad33 ("bcm53xx: switch to the kernel 5.10")

Kernel bump that introduced upstream commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size") which
dropped speed by 49%.


9. e9672b1a8fa4 ("bcm53xx: switch to the upstream DSA-based b53 driver")

At first it seemed like a decrease of network performance by 5%.
Profiling has revealed it was caused by an added csum_partial() call.
Further debugging showed it was tcp4_gro_receive() that started calling
ti.

Long story short: with DSA GRO needs disabling on all switch interfaces.

After some further testing it seems DSA actually bumped network speed
from 404 Mb/s to 445 Mb/s. From profiling it again isn't clear why.

swconfig:
13.46%  ksoftirqd/0  [kernel.kallsyms][k] v7_dma_inv_range
 7.39%  ksoftirqd/0  [kernel.kallsyms][k] l2c210_inv_range
 3.27%  ksoftirqd/0  [kernel.kallsyms][k] v7_dma_clean_range
 2.74%  ksoftirqd/0  [kernel.kallsyms][k] 
__netif_receive_skb_core.constprop.0
 2.72%  ksoftirqd/0  [kernel.kallsyms][k] l2c210_clean_range
 2.71%  ksoftirqd/0  [kernel.kallsyms][k] bgmac_poll
 2.56%  ksoftirqd/0  [kernel.kallsyms][k] bgmac_start_xmit
 2.31%  ksoftirqd/0  [kernel.kallsyms][k] fib_table_lookup
 1.91%  ksoftirqd/0  [kernel.kallsyms][k] 

Testing network / NAT performance

2022-06-12 Thread Rafał Miłecki

Over years I saw multiple reports that new OpenWrt release / kernel
update / netifd change / DSA introduction caused a regression in router
network / NAT speed (masquerade NAT in most cases). Most of those
reports remained unresolved I believe.

The problem is that:
1. OpenWrt doesn't have automated testing environments
2. Developers can't figure anything from undetailed reports
3. Even experienced users don't know how to do proper debugging

I spent almost 2 last months researching & testing masquerade NAT
performance. I thought I'll share my find outs & results. Hopefully
this will get more people involved in tracing & fixing such
regressions.


*
* Testing method
*

In 99% cases it's a totally bad idea to use online speed test services.
They may be too unreliable. It's better to setup a local server instead.

For actual testing you may use iperf or iperf3. If needed - for some
reason - FTP, HTTP or another protocol may be an option too.


*
* Testing results
*

Network traffic is often not perfectly stable. To avoid getting false
results it may be worth to:
1. Repeat test in few sessions
2. Reject lowest & highest results
3. Calculate an average speed

Example of my testing:

for i in $(seq 1 5); do
date
iperf -t 80 -i 10 -c 192.168.99.1 | head -n -1 | sed -n 's/.* 
\([0-9][0-9]*\) Mbits\/sec.*/\1/p' | sort -n
echo
sleep 15
done

Above script lists 8 results from each iperf session. Later I get middle
4 and calculate avarage from them. Then I calculate average from all 5
sessions. It may be an overkill but it was meant to deal with some
really unstable cases.


*
* Environment setup
*

Get some (usually 2) PCs powerful enough to easily handle maximum
expected router traffic. Once setup avoid changing anything. Kernel
update or configuration change on PC may affect results even if router
is a bottleneck [1]. Disable power saving - I noticed once a lower
performance whenever screen saver got activated.

Connect PC to WAN port and setup it to use a static IP. You may setup
DHCP server too or just make OpenWrt use static WAN IP & gateway. Start
iperf / FTP / HTTP / whatever server.

Connect another PC to LAN port and install a matching client for
generating network traffic.


*
* OpenWrt customizations
*

Depending on setup you may need some custom configuration changes. To
avoid applying them manually on every boot use uci-defaults scripts.

Example of my WAN setup:

mkdir -p files/etc/uci-defaults/

cat << EOF > files/etc/uci-defaults/90-nat.sh
#!/bin/sh
uci set network.wan.proto='static'
uci set network.wan.ipaddr='192.168.99.2'
uci set network.wan.netmask='255.255.255.0'
EOF


*
* Finding regressions
*

In continuous testing pick an interval (every day testing or every n-th
commit testing) and look for regressions.

If you notice a regression the first step is to find the first bad
commit. End users often assume that regression was caused by a kernel
change as that is the simplest difference to notice. Always find exact
commit.

Make sure to use git bisect [2] for finding first bad commits.


*
* Stabilizng performance
*

Probably the most annoying problem in debugging are unstable results.
Speed changing between testing sessions / reboots / recompilations makes
the whole testing unreliable and makes it hard to find a real
regression.

Below are few tips that may help stabilizing network speeds.

1. Repeat tests and get average

   Explained above.

2. Don't change environment setup

   Explained above.

3. Use pfifo qdisc

   It should be more stable for simple traffic (e.g. iperf generated).
   Include "tc" package and execute something like:

   tc qdisc replace dev eth0 root pfifo

   Verify with:

   tc qdisc

4. Adjust rps_cpus and xps_cpus

   On multi-CPU devices having multiple CPUs assigned to a single
   network device may result in traffic being assigned to random CPU and
   in varying speeds across testing sessions.

5. Disable CONFIG_SMP

   This will likely reduce performance but may help finding regression
   if testing results vary a lot.

6. Organizing kernel symbols

   CPUs of home routers usually have small caches. The way kernel
   symbols get organized during compilation may significantly affect
   network performance [3]. It's especially annoying as network
   unrelated changes may move / reorder symbols and affect cache hits &
   misses.

   There isn't a reliable solution for that. It may help to add:
   -falign-functions=32 or
   -falign-functions=64 (depending on platform).
   using e.g. KBUILD_CFLAGS.


*
* Profiling
*

Profiling with "perf" [4] allows checking what consumes CPUs. It's very
useful for finding