Re: RSS causing bad forwarding performance?

2025-12-08 Thread Adrian Chadd
On Mon, 8 Dec 2025 at 06:24, Kajetan Staszkiewicz  wrote:
>
> On 2025-12-08 00:55, Konstantin Belousov wrote:
>
> > It is somewhat strange that with/without RSS results differ for UDP.
> > mlx5en driver always enable hashing the packet into rx queue.  And,
> > with single UDP stream I would expect all packets to hit the same queue.
> With a single UDP stream and RSS disabled the DUT gets 2 CPU cores
> loaded. One at 100%, I understand this is where the interrupts for
> incoming packets land and it handles receiving, forwarding and sending
> the packet (with direct ISR dispatch) and another around 15-20%, my best
> guess that it's handling interrupts for confirmations of packets sent
> out through the other NIC.
>
> With a single UDP stream and RSS enabled the DUT gets only 1 CPU core
> loaded. I understand that thanks to RSS the outbound queue on mce1 is
> the same as inbound queue on mce0 and thus the same CPU core handles irq
> for both queues.
>
> > As consequence, with/without RSS should be same (low).
>
> It is low for no RSS, but with RSS it's not just low, it's terrible.
>
> > Would it be UDP which encapsulates some other traffic, e.g. tunnel that
> > can be further classified by the internal headers, like inner headers
> > of the vxlan, then more that one receive queue could be used.
>
> The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP
> packets from port 1025 to port 12, filled with 0x78, length 10 B. My
> goal is to test packets per second performance, so I've choosen this
> test as it creates very short packets.
>
> > BTW, mce cards have huge numbers of supported offloads, but all of them are
> > host-oriented, they would not help for the forwarding.
>
> > Again, since iperf stream would hit single send/receive queue.
> > Parallel iperfs between same machines scale.
>
> It seems that parallel streams forwarded through the machine scale too.
> It's a single stream that kills it, and only with option RSS enabled.

RSS was never really designed for optimising a single flow by having it consume
two CPU cores.

It was designed for optimising a /whole lot of flows/ by directing
them to a consistent
CPU mapping and if used in conjunction with CPU selection for the transmit side,
to avoid cross-CPU locking/synchronisation entirely.

It doesn't help that the RSS defaults (ie only one netisr, not hybrid
mode IIRC, etc)
are not the best for lots of flows.

So in short, I think you're testing the wrong thing.



-adrian



Re: RSS causing bad forwarding performance?

2025-12-08 Thread Kajetan Staszkiewicz
On 2025-12-08 00:55, Konstantin Belousov wrote:

> It is somewhat strange that with/without RSS results differ for UDP.
> mlx5en driver always enable hashing the packet into rx queue.  And,
> with single UDP stream I would expect all packets to hit the same queue.
With a single UDP stream and RSS disabled the DUT gets 2 CPU cores
loaded. One at 100%, I understand this is where the interrupts for
incoming packets land and it handles receiving, forwarding and sending
the packet (with direct ISR dispatch) and another around 15-20%, my best
guess that it's handling interrupts for confirmations of packets sent
out through the other NIC.

With a single UDP stream and RSS enabled the DUT gets only 1 CPU core
loaded. I understand that thanks to RSS the outbound queue on mce1 is
the same as inbound queue on mce0 and thus the same CPU core handles irq
for both queues.

> As consequence, with/without RSS should be same (low).

It is low for no RSS, but with RSS it's not just low, it's terrible.

> Would it be UDP which encapsulates some other traffic, e.g. tunnel that
> can be further classified by the internal headers, like inner headers
> of the vxlan, then more that one receive queue could be used.

The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP
packets from port 1025 to port 12, filled with 0x78, length 10 B. My
goal is to test packets per second performance, so I've choosen this
test as it creates very short packets.

> BTW, mce cards have huge numbers of supported offloads, but all of them are
> host-oriented, they would not help for the forwarding.

> Again, since iperf stream would hit single send/receive queue.
> Parallel iperfs between same machines scale.

It seems that parallel streams forwarded through the machine scale too.
It's a single stream that kills it, and only with option RSS enabled.

-- 
| pozdrawiam / regards | Powered by Debian and FreeBSD  |
| Kajetan Staszkiewicz |   www: http://tuxpowered.net   |
|  | matrix: @vegeta:tuxpowered.net |
`--^'



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: RSS causing bad forwarding performance?

2025-12-07 Thread Konstantin Belousov
On Sun, Dec 07, 2025 at 11:04:57PM +0100, Kajetan Staszkiewicz wrote:
> Hello Group,
> 
> I'm using Cisco TRex to evaluate forwarding perfomance of my FreeBSD
> routers. I wanted to establish a baseline of what FreeBSD 15 can forward
> without pf and compliated routing. The DUT is using a 6-core Intel
> E-2146G CPU with disabled HT, Intel x520 for management and Mellanox
> ConnectX-5 for forwarding. The mce interfaces use a separate fib, there
> are just a few static routes to make TRex work, they are configured as
> they should be for a router: -lro -mediaopt rxpause,txpause. The tests
> have been performed without any NIC sysctl tuning.
> 
> Testing is done with
> single udp stream:
>   start -f stl/udp_1pkt_simple.py -m 50% --port 0
> multiple udp streams:
>   start -f stl/udp_1pkt_repeat_random.py -m 50% --port 0
> 
> Links are at 25Gb/s so at 50% TRex pushes around 18 Mpps to the DUT.
> 
> NetISR is configured to make use of all CPU cores:
> net.isr.bindthreads=1
> net.isr.maxthreads=-1
> 
> On the GENERIC kernel I'm getting:
> dispatch=deferred single stream:  5.2  Mpps
> dispatch=deferred multiple streams :  4.2  Mpps
> dispatch=direct   single stream:  3.2  Mpps
> dispatch=direct   multiple streams : 10.7  Mpps
> 
> GENERIC + option RSS:
> dispatch=deferred single stream:  0.4  Mpps
> dispatch=deferred multiple streams : 11.0  Mpps
> dispatch=direct   single stream:  0.4  Mpps
> dispatch=direct   multiple streams : 11.0  Mpps
> 
> GENERIC + option RSS + forwarding over Intel x520 NICs just to be sure
> that it's not Mellanox's fault:
> dispatch=deferred single stream:  between 1.9 and 0.1 Mpps
> dispatch=deferred multiple streams :  4.5  Mpps
> dispatch=direct   single stream:  between 1.9 and 0.1 Mpps
> dispatch=direct   multiple streams :  4.5  Mpps
> 
> As you can see with option RSS and a single UDP stream the router
> totally clogs, dropping forwarding performance as low as 100kps. Without
> option RSS it works just fine.
It is somewhat strange that with/without RSS results differ for UDP.
mlx5en driver always enable hashing the packet into rx queue.  And,
with single UDP stream I would expect all packets to hit the same queue.
As consequence, with/without RSS should be same (low).

Would it be UDP which encapsulates some other traffic, e.g. tunnel that
can be further classified by the internal headers, like inner headers
of the vxlan, then more that one receive queue could be used.

BTW, mce cards have huge numbers of supported offloads, but all of them are
host-oriented, they would not help for the forwarding.

> 
> Please note, that this test is not about forwarding "real" traffic, like
> an iperf TCP stream, which would adjust the packet sending rate to
> capacity of the DUT, but flooding it with more traffic than it can not
> forward. Sadly the later is often the case for devices exposed to the
> Internet.

Again, since iperf stream would hit single send/receive queue.
Parallel iperfs between same machines scale.