Re: RSS causing bad forwarding performance?
On Mon, 8 Dec 2025 at 06:24, Kajetan Staszkiewicz wrote: > > On 2025-12-08 00:55, Konstantin Belousov wrote: > > > It is somewhat strange that with/without RSS results differ for UDP. > > mlx5en driver always enable hashing the packet into rx queue. And, > > with single UDP stream I would expect all packets to hit the same queue. > With a single UDP stream and RSS disabled the DUT gets 2 CPU cores > loaded. One at 100%, I understand this is where the interrupts for > incoming packets land and it handles receiving, forwarding and sending > the packet (with direct ISR dispatch) and another around 15-20%, my best > guess that it's handling interrupts for confirmations of packets sent > out through the other NIC. > > With a single UDP stream and RSS enabled the DUT gets only 1 CPU core > loaded. I understand that thanks to RSS the outbound queue on mce1 is > the same as inbound queue on mce0 and thus the same CPU core handles irq > for both queues. > > > As consequence, with/without RSS should be same (low). > > It is low for no RSS, but with RSS it's not just low, it's terrible. > > > Would it be UDP which encapsulates some other traffic, e.g. tunnel that > > can be further classified by the internal headers, like inner headers > > of the vxlan, then more that one receive queue could be used. > > The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP > packets from port 1025 to port 12, filled with 0x78, length 10 B. My > goal is to test packets per second performance, so I've choosen this > test as it creates very short packets. > > > BTW, mce cards have huge numbers of supported offloads, but all of them are > > host-oriented, they would not help for the forwarding. > > > Again, since iperf stream would hit single send/receive queue. > > Parallel iperfs between same machines scale. > > It seems that parallel streams forwarded through the machine scale too. > It's a single stream that kills it, and only with option RSS enabled. RSS was never really designed for optimising a single flow by having it consume two CPU cores. It was designed for optimising a /whole lot of flows/ by directing them to a consistent CPU mapping and if used in conjunction with CPU selection for the transmit side, to avoid cross-CPU locking/synchronisation entirely. It doesn't help that the RSS defaults (ie only one netisr, not hybrid mode IIRC, etc) are not the best for lots of flows. So in short, I think you're testing the wrong thing. -adrian
Re: RSS causing bad forwarding performance?
On 2025-12-08 00:55, Konstantin Belousov wrote: > It is somewhat strange that with/without RSS results differ for UDP. > mlx5en driver always enable hashing the packet into rx queue. And, > with single UDP stream I would expect all packets to hit the same queue. With a single UDP stream and RSS disabled the DUT gets 2 CPU cores loaded. One at 100%, I understand this is where the interrupts for incoming packets land and it handles receiving, forwarding and sending the packet (with direct ISR dispatch) and another around 15-20%, my best guess that it's handling interrupts for confirmations of packets sent out through the other NIC. With a single UDP stream and RSS enabled the DUT gets only 1 CPU core loaded. I understand that thanks to RSS the outbound queue on mce1 is the same as inbound queue on mce0 and thus the same CPU core handles irq for both queues. > As consequence, with/without RSS should be same (low). It is low for no RSS, but with RSS it's not just low, it's terrible. > Would it be UDP which encapsulates some other traffic, e.g. tunnel that > can be further classified by the internal headers, like inner headers > of the vxlan, then more that one receive queue could be used. The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP packets from port 1025 to port 12, filled with 0x78, length 10 B. My goal is to test packets per second performance, so I've choosen this test as it creates very short packets. > BTW, mce cards have huge numbers of supported offloads, but all of them are > host-oriented, they would not help for the forwarding. > Again, since iperf stream would hit single send/receive queue. > Parallel iperfs between same machines scale. It seems that parallel streams forwarded through the machine scale too. It's a single stream that kills it, and only with option RSS enabled. -- | pozdrawiam / regards | Powered by Debian and FreeBSD | | Kajetan Staszkiewicz | www: http://tuxpowered.net | | | matrix: @vegeta:tuxpowered.net | `--^' OpenPGP_signature.asc Description: OpenPGP digital signature
Re: RSS causing bad forwarding performance?
On Sun, Dec 07, 2025 at 11:04:57PM +0100, Kajetan Staszkiewicz wrote: > Hello Group, > > I'm using Cisco TRex to evaluate forwarding perfomance of my FreeBSD > routers. I wanted to establish a baseline of what FreeBSD 15 can forward > without pf and compliated routing. The DUT is using a 6-core Intel > E-2146G CPU with disabled HT, Intel x520 for management and Mellanox > ConnectX-5 for forwarding. The mce interfaces use a separate fib, there > are just a few static routes to make TRex work, they are configured as > they should be for a router: -lro -mediaopt rxpause,txpause. The tests > have been performed without any NIC sysctl tuning. > > Testing is done with > single udp stream: > start -f stl/udp_1pkt_simple.py -m 50% --port 0 > multiple udp streams: > start -f stl/udp_1pkt_repeat_random.py -m 50% --port 0 > > Links are at 25Gb/s so at 50% TRex pushes around 18 Mpps to the DUT. > > NetISR is configured to make use of all CPU cores: > net.isr.bindthreads=1 > net.isr.maxthreads=-1 > > On the GENERIC kernel I'm getting: > dispatch=deferred single stream: 5.2 Mpps > dispatch=deferred multiple streams : 4.2 Mpps > dispatch=direct single stream: 3.2 Mpps > dispatch=direct multiple streams : 10.7 Mpps > > GENERIC + option RSS: > dispatch=deferred single stream: 0.4 Mpps > dispatch=deferred multiple streams : 11.0 Mpps > dispatch=direct single stream: 0.4 Mpps > dispatch=direct multiple streams : 11.0 Mpps > > GENERIC + option RSS + forwarding over Intel x520 NICs just to be sure > that it's not Mellanox's fault: > dispatch=deferred single stream: between 1.9 and 0.1 Mpps > dispatch=deferred multiple streams : 4.5 Mpps > dispatch=direct single stream: between 1.9 and 0.1 Mpps > dispatch=direct multiple streams : 4.5 Mpps > > As you can see with option RSS and a single UDP stream the router > totally clogs, dropping forwarding performance as low as 100kps. Without > option RSS it works just fine. It is somewhat strange that with/without RSS results differ for UDP. mlx5en driver always enable hashing the packet into rx queue. And, with single UDP stream I would expect all packets to hit the same queue. As consequence, with/without RSS should be same (low). Would it be UDP which encapsulates some other traffic, e.g. tunnel that can be further classified by the internal headers, like inner headers of the vxlan, then more that one receive queue could be used. BTW, mce cards have huge numbers of supported offloads, but all of them are host-oriented, they would not help for the forwarding. > > Please note, that this test is not about forwarding "real" traffic, like > an iperf TCP stream, which would adjust the packet sending rate to > capacity of the DUT, but flooding it with more traffic than it can not > forward. Sadly the later is often the case for devices exposed to the > Internet. Again, since iperf stream would hit single send/receive queue. Parallel iperfs between same machines scale.
