RE: NAT performance regression caused by vlan GRO support
From: Rafal Milecki > Sent: 07 April 2019 12:55 ... > If not, maybe w really need to think about some good & clever condition for > disabling GRO by default on hw without checksum offloading. Maybe GRO could assume the checksums are valid so the checksum would only be verified when the packet is delivered locally. If the packet is forwarded then, provided the same packet boundaries are used, the original checksums (maybe modified by NAT) can be used. No idea how easy this might be :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: NAT performance regression caused by vlan GRO support
Now I have some questions regarding possible optimizations. Note I'm too familiar with the net subsystem so maybe I got wrong ideas. On 07.04.2019 13:53, Rafał Miłecki wrote: On 04.04.2019 14:57, Rafał Miłecki wrote: Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. I'll try to provide some summary for this issue. I'll focus on TCP traffic as that's what I happened to test. Basically all slowdowns are related to the csum_partial(). Calculating checksum has a significant impact on NAT performance on less CPU powerful devices. ** GRO disabled Without GRO a csum_partial() is used only when validating TCP packets in the nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1). Simplified forward trace for that case: nf_conntrack_in nf_conntrack_tcp_packet tcp_error if (state->net->ct.sysctl_checksum) nf_checksum nf_ip_checksum __skb_checksum_complete That validation can be disabled using nf_conntrack_checksum sysfs and it bumps NAT speed for me from 666 Mb/s to 940 Mb/s (+41%). ** GRO enabled First of all GRO also includes TCP validation that requires calculating a checksum. Simplified forward trace for that case: vlan_gro_receive call_gro_receive inet_gro_receive indirect_call_gro_receive tcp4_gro_receive skb_gro_checksum_validate tcp_gro_receive *If* we had a way to disable that validation it *would* result in bumping NAT speed for me from 577 Mb/s to 825 Mb/s (+43%). Could we have tcp4_gro_receive() behave similarly to the tcp_error() and make it respect the nf_conntrack_checksum sysfs value? Could we simply add something like: if (dev_net(skb->dev)->ct.sysctl_checksum) to it (to additionally protect a skb_gro_checksum_validate() call)? Secondly using GRO means we need to calculate a checksum before transmitting packets (applies to devices without HW checksum offloading). I think it's related to packets merging in the skb_gro_receive() and then setting CHECKSUM_PARTIAL: vlan_gro_complete inet_gro_complete tcp4_gro_complete tcp_gro_complete skb->ip_summed = CHECKSUM_PARTIAL; That results in bgmac calculating a checksum from the scratch, take a look at the bgmac_dma_tx_add() which does: if (skb->ip_summed == CHECKSUM_PARTIAL) skb_checksum_help(skb); Performing that whole checksum calculation will always result in GRO slowing down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs. Is this possible to avoid CHECKSUM_PARTIAL & skb_checksum_help() which has to calculate a whole checksum? It's definitely possible to *update* checksum after simple packet changes (e.g. amending an IP or port). Would that be possible to use similar method when dealing with packets with GRO enabled? If not, maybe w really need to think about some good & clever condition for disabling GRO by default on hw without checksum offloading.
Re: NAT performance regression caused by vlan GRO support
On 04.04.2019 14:57, Rafał Miłecki wrote: Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. I'll try to provide some summary for this issue. I'll focus on TCP traffic as that's what I happened to test. Basically all slowdowns are related to the csum_partial(). Calculating checksum has a significant impact on NAT performance on less CPU powerful devices. ** GRO disabled Without GRO a csum_partial() is used only when validating TCP packets in the nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1). Simplified forward trace for that case: nf_conntrack_in nf_conntrack_tcp_packet tcp_error if (state->net->ct.sysctl_checksum) nf_checksum nf_ip_checksum __skb_checksum_complete That validation can be disabled using nf_conntrack_checksum sysfs and it bumps NAT speed for me from 666 Mb/s to 940 Mb/s (+41%). ** GRO enabled First of all GRO also includes TCP validation that requires calculating a checksum. Simplified forward trace for that case: vlan_gro_receive call_gro_receive inet_gro_receive indirect_call_gro_receive tcp4_gro_receive skb_gro_checksum_validate tcp_gro_receive *If* we had a way to disable that validation it *would* result in bumping NAT speed for me from 577 Mb/s to 825 Mb/s (+43%). Secondly using GRO means we need to calculate a checksum before transmitting packets (applies to devices without HW checksum offloading). I think it's related to packets merging in the skb_gro_receive() and then setting CHECKSUM_PARTIAL: vlan_gro_complete inet_gro_complete tcp4_gro_complete tcp_gro_complete skb->ip_summed = CHECKSUM_PARTIAL; That results in bgmac calculating a checksum from the scratch, take a look at the bgmac_dma_tx_add() which does: if (skb->ip_summed == CHECKSUM_PARTIAL) skb_checksum_help(skb); Performing that whole checksum calculation will always result in GRO slowing down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs.
Re: NAT performance regression caused by vlan GRO support
On 04/05/2019 03:51 AM, Florian Westphal wrote: > Toke Høiland-Jørgensen wrote: >> As a first approximation, maybe just: >> >> if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps) >> disable_gro(); > > I don't think its a good idea. For local delivery case, there is no > way to avoid the checksum cost, so might as well have GRO enabled. > We might add a sysctl or a way to tell GRO layer : Do not attempt checksumming if forwarding is enabled on the host. Basically GRO if NIC has provided checksum offload.
Re: NAT performance regression caused by vlan GRO support
Toke Høiland-Jørgensen wrote: > As a first approximation, maybe just: > > if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps) > disable_gro(); I don't think its a good idea. For local delivery case, there is no way to avoid the checksum cost, so might as well have GRO enabled.
Re: NAT performance regression caused by vlan GRO support
Toshiaki Makita writes: > On 2019/04/05 16:14, Felix Fietkau wrote: >> On 2019-04-05 09:11, Rafał Miłecki wrote: >>> On 05.04.2019 07:48, Rafał Miłecki wrote: On 05.04.2019 06:26, Toshiaki Makita wrote: > My test results: > > Receiving packets from eth0.10, forwarding them to eth0.20 and applying > MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. > Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). > Measured TCP throughput by netperf. > > GRO on : 17 Gbps > GRO off: 5 Gbps > > So I failed to reproduce your problem. :( Thanks for trying & checking that! > Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar > -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on > your machine? 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 3) System idle (no iperf) root@OpenWrt:/# mpstat -P ALL 10 1 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 >
Re: NAT performance regression caused by vlan GRO support
On 05.04.2019 10:12, Rafał Miłecki wrote: On 05.04.2019 09:58, Toshiaki Makita wrote: On 2019/04/05 16:14, Felix Fietkau wrote: On 2019-04-05 09:11, Rafał Miłecki wrote: I guess its GRO + csum_partial() to be blamed for this performance drop. Maybe csum_partial() is very fast on your powerful machine and few extra calls don't make a difference? I can imagine it affecting much slower home router with ARM cores. Most high performance Ethernet devices implement hardware checksum offload, which completely gets rid of this overhead. Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is why you're getting such crappy performance. Hmm... now I disabled rx checksum and tried the test again, and indeed I see csum_partial from GRO path. But I also see csum_partial even without GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete. Probably Rafał disabled nf_conntrack_checksum sysctl knob? But anyway even with disabling rx csum offload my machine has better performance with GRO. I'm sure in some cases GRO should be disabled, but I guess it's difficult to determine whether we should disable GRO or not automatically when csum offload is not available. Few testing results: 1) ethtool -K eth0 gro off; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 6.57 GBytes 940 Mbits/sec 2) ethtool -K eth0 gro off; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.65 GBytes 666 Mbits/sec For this case (GRO off and nf_conntrack_checksum enabled) I can confirm I see csum_partial() in the perf output. It's taking 13,14% instead of 25,46% (as when using GRO) though. Samples: 38K of event 'cycles', Event count (approx.): 12209908413 Overhead Command Shared Object Symbol + 13,14% ksoftirqd/1 [kernel.kallsyms] [k] csum_partial + 10,16% swapper [kernel.kallsyms] [k] v7_dma_inv_range +6,36% swapper [kernel.kallsyms] [k] l2c210_inv_range +4,89% swapper [kernel.kallsyms] [k] __irqentry_text_end +4,12% ksoftirqd/1 [kernel.kallsyms] [k] v7_dma_clean_range +3,78% swapper [kernel.kallsyms] [k] bcma_host_soc_read32 +2,76% swapper [kernel.kallsyms] [k] arch_cpu_idle +2,45% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core +2,37% ksoftirqd/1 [kernel.kallsyms] [k] l2c210_clean_range +1,76% ksoftirqd/1 [kernel.kallsyms] [k] bgmac_start_xmit +1,66% swapper [kernel.kallsyms] [k] bgmac_poll +1,55% ksoftirqd/1 [kernel.kallsyms] [k] __dev_queue_xmit +1,11% ksoftirqd/1 [kernel.kallsyms] [k] skb_vlan_untag 3) ethtool -K eth0 gro on; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.02 GBytes 575 Mbits/sec 4) ethtool -K eth0 gro on; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.04 GBytes 579 Mbits/sec
Re: NAT performance regression caused by vlan GRO support
On 05.04.2019 09:58, Toshiaki Makita wrote: On 2019/04/05 16:14, Felix Fietkau wrote: On 2019-04-05 09:11, Rafał Miłecki wrote: I guess its GRO + csum_partial() to be blamed for this performance drop. Maybe csum_partial() is very fast on your powerful machine and few extra calls don't make a difference? I can imagine it affecting much slower home router with ARM cores. Most high performance Ethernet devices implement hardware checksum offload, which completely gets rid of this overhead. Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is why you're getting such crappy performance. Hmm... now I disabled rx checksum and tried the test again, and indeed I see csum_partial from GRO path. But I also see csum_partial even without GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete. Probably Rafał disabled nf_conntrack_checksum sysctl knob? But anyway even with disabling rx csum offload my machine has better performance with GRO. I'm sure in some cases GRO should be disabled, but I guess it's difficult to determine whether we should disable GRO or not automatically when csum offload is not available. Few testing results: 1) ethtool -K eth0 gro off; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 6.57 GBytes 940 Mbits/sec 2) ethtool -K eth0 gro off; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.65 GBytes 666 Mbits/sec 3) ethtool -K eth0 gro on; echo 0 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.02 GBytes 575 Mbits/sec 4) ethtool -K eth0 gro on; echo 1 > /proc/sys/net/netfilter/nf_conntrack_checksum [ 6] 0.0-60.0 sec 4.04 GBytes 579 Mbits/sec
Re: NAT performance regression caused by vlan GRO support
On 2019/04/05 16:14, Felix Fietkau wrote: > On 2019-04-05 09:11, Rafał Miłecki wrote: >> On 05.04.2019 07:48, Rafał Miłecki wrote: >>> On 05.04.2019 06:26, Toshiaki Makita wrote: My test results: Receiving packets from eth0.10, forwarding them to eth0.20 and applying MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). Measured TCP throughput by netperf. GRO on : 17 Gbps GRO off: 5 Gbps So I failed to reproduce your problem. >>> >>> :( Thanks for trying & checking that! >>> >>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on your machine? >>> >>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) >>> root@OpenWrt:/# mpstat -P ALL 10 3 >>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>> >>> 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 >>> 0.00 41.21 >>> 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 >>> 0.00 0.00 >>> 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 >>> 0.00 82.42 >>> >>> 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 >>> 0.00 40.51 >>> 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 >>> 0.00 0.00 >>> 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 >>> 0.00 81.02 >>> >>> 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 >>> 0.00 40.41 >>> 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 >>> 0.00 0.00 >>> 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 >>> 0.00 80.82 >>> >>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 >>> 0.00 40.71 >>> Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 >>> 0.00 0.00 >>> Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 >>> 0.00 81.42 >>> >>> >>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) >>> root@OpenWrt:/# mpstat -P ALL 10 3 >>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>> >>> 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 >>> 0.00 13.04 >>> 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 >>> 0.00 21.68 >>> 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 >>> 0.00 4.40 >>> >>> 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 >>> 0.00 12.84 >>> 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 >>> 0.00 20.08 >>> 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 >>> 0.00 5.59 >>> >>> 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 >>> 0.00 14.24 >>> 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 >>> 0.00 20.48 >>> 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 >>> 0.00 7.99 >>> >>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 >>> 0.00 13.37 >>> Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 >>> 0.00 20.75 >>> Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 >>> 0.00 5.99 >>> >>> >>> 3) System idle (no iperf) >>> root@OpenWrt:/# mpstat -P ALL 10 1 >>> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >>> >>> 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 100.00 >>> 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 100.00 >>> 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 100.00 >>> >>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal >>> %guest %idle >>> Average: all 0.00 0.00 0.0
Re: NAT performance regression caused by vlan GRO support
On 05.04.2019 07:48, Rafał Miłecki wrote: On 05.04.2019 06:26, Toshiaki Makita wrote: My test results: Receiving packets from eth0.10, forwarding them to eth0.20 and applying MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). Measured TCP throughput by netperf. GRO on : 17 Gbps GRO off: 5 Gbps So I failed to reproduce your problem. :( Thanks for trying & checking that! Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on your machine? 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 0.00 41.21 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 0.00 82.42 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 0.00 40.51 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 0.00 81.02 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 0.00 40.41 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 0.00 80.82 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 0.00 40.71 Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 0.00 0.00 Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 0.00 81.42 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 0.00 13.04 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 0.00 21.68 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 0.00 4.40 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 0.00 12.84 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 0.00 20.08 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 0.00 5.59 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 0.00 14.24 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 0.00 20.48 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 0.00 7.99 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 0.00 13.37 Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 0.00 20.75 Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 0.00 5.99 3) System idle (no iperf) root@OpenWrt:/# mpstat -P ALL 10 1 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 If CPU is 100%, perf may help us analyze your problem. If it's available, try running below while testing: # perf record -a -g -- sleep 5 And then run this after testing: # perf report --no-child I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now. I guess its GRO + csum_partial() to be
Re: NAT performance regression caused by vlan GRO support
On 2019-04-05 09:11, Rafał Miłecki wrote: > On 05.04.2019 07:48, Rafał Miłecki wrote: >> On 05.04.2019 06:26, Toshiaki Makita wrote: >>> My test results: >>> >>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying >>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. >>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). >>> Measured TCP throughput by netperf. >>> >>> GRO on : 17 Gbps >>> GRO off: 5 Gbps >>> >>> So I failed to reproduce your problem. >> >> :( Thanks for trying & checking that! >> >> >>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar >>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on >>> your machine? >> >> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) >> root@OpenWrt:/# mpstat -P ALL 10 3 >> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >> >> 16:33:40 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:33:50 all 0.00 0.00 0.00 0.00 0.00 58.79 0.00 >> 0.00 41.21 >> 16:33:50 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 >> 0.00 0.00 >> 16:33:50 1 0.00 0.00 0.00 0.00 0.00 17.58 0.00 >> 0.00 82.42 >> >> 16:33:50 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:34:00 all 0.00 0.00 0.05 0.00 0.00 59.44 0.00 >> 0.00 40.51 >> 16:34:00 0 0.00 0.00 0.10 0.00 0.00 99.90 0.00 >> 0.00 0.00 >> 16:34:00 1 0.00 0.00 0.00 0.00 0.00 18.98 0.00 >> 0.00 81.02 >> >> 16:34:00 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:34:10 all 0.00 0.00 0.00 0.00 0.00 59.59 0.00 >> 0.00 40.41 >> 16:34:10 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 >> 0.00 0.00 >> 16:34:10 1 0.00 0.00 0.00 0.00 0.00 19.18 0.00 >> 0.00 80.82 >> >> Average: CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> Average: all 0.00 0.00 0.02 0.00 0.00 59.27 0.00 >> 0.00 40.71 >> Average: 0 0.00 0.00 0.03 0.00 0.00 99.97 0.00 >> 0.00 0.00 >> Average: 1 0.00 0.00 0.00 0.00 0.00 18.58 0.00 >> 0.00 81.42 >> >> >> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) >> root@OpenWrt:/# mpstat -P ALL 10 3 >> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >> >> 16:34:39 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:34:49 all 0.00 0.00 0.05 0.00 0.00 86.91 0.00 >> 0.00 13.04 >> 16:34:49 0 0.00 0.00 0.10 0.00 0.00 78.22 0.00 >> 0.00 21.68 >> 16:34:49 1 0.00 0.00 0.00 0.00 0.00 95.60 0.00 >> 0.00 4.40 >> >> 16:34:49 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:34:59 all 0.00 0.00 0.10 0.00 0.00 87.06 0.00 >> 0.00 12.84 >> 16:34:59 0 0.00 0.00 0.20 0.00 0.00 79.72 0.00 >> 0.00 20.08 >> 16:34:59 1 0.00 0.00 0.00 0.00 0.00 94.41 0.00 >> 0.00 5.59 >> >> 16:34:59 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:35:09 all 0.00 0.00 0.05 0.00 0.00 85.71 0.00 >> 0.00 14.24 >> 16:35:09 0 0.00 0.00 0.10 0.00 0.00 79.42 0.00 >> 0.00 20.48 >> 16:35:09 1 0.00 0.00 0.00 0.00 0.00 92.01 0.00 >> 0.00 7.99 >> >> Average: CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> Average: all 0.00 0.00 0.07 0.00 0.00 86.56 0.00 >> 0.00 13.37 >> Average: 0 0.00 0.00 0.13 0.00 0.00 79.12 0.00 >> 0.00 20.75 >> Average: 1 0.00 0.00 0.00 0.00 0.00 94.01 0.00 >> 0.00 5.99 >> >> >> 3) System idle (no iperf) >> root@OpenWrt:/# mpstat -P ALL 10 1 >> Linux 5.1.0-rc3+ (OpenWrt) 03/27/19 _armv7l_ (2 CPU) >> >> 16:35:31 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> 16:35:41 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 100.00 >> 16:35:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 100.00 >> 16:35:41 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 100.00 >> >> Average: CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle >> Average: all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 100.00 >> Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 10
Re: NAT performance regression caused by vlan GRO support
On 05.04.2019 06:26, Toshiaki Makita wrote: On 2019/04/05 5:22, Rafał Miłecki wrote: On 04.04.2019 17:17, Toshiaki Makita wrote: On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: I'd like to report a regression that goes back to the 2015. I know it's damn late, but the good thing is, the regression is still easy to reproduce, verify & revert. Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. My hardware is BCM47094 SoC (dual core ARM) with integrated network controller and external BCM53012 switch. Relevant setup: * SoC network controller is wired to the hardware switch * Switch passes 802.1q frames with VID 1 to four LAN ports * Switch passes 802.1q frames with VID 2 to WAN port * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) * Linux uses pfifo and "echo 2 > rps_cpus" * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port * Intel i7-2670QM laptop connected to a WAN port * Speed of LAN to WAN measured using iperf & TCP over 10 minutes 1) 5.1.0-rc3 [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec 2) 5.1.0-rc3 + rtcache patch [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec 3) 5.1.0-rc3 + disable GRO support [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec 4) 5.1.0-rc3 + rtcache patch + disable GRO support [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec Did you test it with disabling GRO by ethtool -K? Oh, I didn't know about such possibility! I just tested: 1) Kernel with GRO support left in place (no local patch disabling it) 2) ethtool -K eth0 gro off and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably break/fix NAT performance by just calling ethtool -K eth0 gro on/off. Is this the result with your reverting patch? Previous results were coming from kernel with patched vlan_offload_init() - see diff at the end of my first e-mail. It's late night in Japan so I think I will try to reproduce it tomorrow. My test results: Receiving packets from eth0.10, forwarding them to eth0.20 and applying MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). Measured TCP throughput by netperf. GRO on : 17 Gbps GRO off: 5 Gbps So I failed to reproduce your problem. :( Thanks for trying & checking that! Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on your machine? 1) ethtool -K eth0 gro on + iperf running (577 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19_armv7l_(2 CPU) 16:33:40 CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 16:33:50 all0.000.000.000.000.00 58.790.00 0.00 41.21 16:33:50 00.000.000.000.000.00 100.000.00 0.000.00 16:33:50 10.000.000.000.000.00 17.580.00 0.00 82.42 16:33:50 CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 16:34:00 all0.000.000.050.000.00 59.440.00 0.00 40.51 16:34:00 00.000.000.100.000.00 99.900.00 0.000.00 16:34:00 10.000.000.000.000.00 18.980.00 0.00 81.02 16:34:00 CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 16:34:10 all0.000.000.000.000.00 59.590.00 0.00 40.41 16:34:10 00.000.000.000.000.00 100.000.00 0.000.00 16:34:10 10.000.000.000.000.00 19.180.00 0.00 80.82 Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle Average: all0.000.000.020.000.00 59.270.00 0.00 40.71 Average: 00.000.000.030.000.00 99.970.00 0.000.00 Average: 10.000.000.000.000.00 18.580.00 0.00 81.42 2) ethtool -K eth0 gro off + iperf running (941 Mb/s) root@OpenWrt:/# mpstat -P ALL 10 3 Linux 5.1.0-rc3+ (OpenWrt) 03/27/19_armv7l_(2 CPU) 16:34:39 CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 16:34:49 all0.000.000.050.000.00 86.910.00 0.00 13.04 16:34:49 00.000.000.100.000.00 78.220.00 0.00 21.68 16:34:49 10.000.000.000.000.00 95.600.00 0.004.40 16:34:49 CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 16:34:59 all0.000.000.100.000.00 87.060.00 0.00 12.84 16:34:59 00.000.000.200.000.00 79.720.00 0.00 20.08 16:34:59 10.000.00
Re: NAT performance regression caused by vlan GRO support
On 2019/04/05 5:22, Rafał Miłecki wrote: > On 04.04.2019 17:17, Toshiaki Makita wrote: >> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: >>> I'd like to report a regression that goes back to the 2015. I know >>> it's damn >>> late, but the good thing is, the regression is still easy to >>> reproduce, verify & >>> revert. >>> >>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add >>> GRO support >>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT >>> performance of my router dropped by 30% - 40%. >>> >>> My hardware is BCM47094 SoC (dual core ARM) with integrated network >>> controller >>> and external BCM53012 switch. >>> >>> Relevant setup: >>> * SoC network controller is wired to the hardware switch >>> * Switch passes 802.1q frames with VID 1 to four LAN ports >>> * Switch passes 802.1q frames with VID 2 to WAN port >>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) >>> * Linux uses pfifo and "echo 2 > rps_cpus" >>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port >>> * Intel i7-2670QM laptop connected to a WAN port >>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes >>> >>> 1) 5.1.0-rc3 >>> [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec >>> >>> 2) 5.1.0-rc3 + rtcache patch >>> [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec >>> >>> 3) 5.1.0-rc3 + disable GRO support >>> [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec >>> >>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support >>> [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec >> >> Did you test it with disabling GRO by ethtool -K? > > Oh, I didn't know about such possibility! I just tested: > 1) Kernel with GRO support left in place (no local patch disabling it) > 2) ethtool -K eth0 gro off > and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably > break/fix NAT performance by just calling ethtool -K eth0 gro on/off. > > >> Is this the result with your reverting patch? > > Previous results were coming from kernel with patched > vlan_offload_init() - see > diff at the end of my first e-mail. > > >> It's late night in Japan so I think I will try to reproduce it tomorrow. My test results: Receiving packets from eth0.10, forwarding them to eth0.20 and applying MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13. Disabled rxvlan by ethtool -K to exercise vlan_gro_receive(). Measured TCP throughput by netperf. GRO on : 17 Gbps GRO off: 5 Gbps So I failed to reproduce your problem. Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on your machine? If CPU is 100%, perf may help us analyze your problem. If it's available, try running below while testing: # perf record -a -g -- sleep 5 And then run this after testing: # perf report --no-child -- Toshiaki Makita
Re: NAT performance regression caused by vlan GRO support
On 04.04.2019 17:17, Toshiaki Makita wrote: On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: I'd like to report a regression that goes back to the 2015. I know it's damn late, but the good thing is, the regression is still easy to reproduce, verify & revert. Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. My hardware is BCM47094 SoC (dual core ARM) with integrated network controller and external BCM53012 switch. Relevant setup: * SoC network controller is wired to the hardware switch * Switch passes 802.1q frames with VID 1 to four LAN ports * Switch passes 802.1q frames with VID 2 to WAN port * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) * Linux uses pfifo and "echo 2 > rps_cpus" * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port * Intel i7-2670QM laptop connected to a WAN port * Speed of LAN to WAN measured using iperf & TCP over 10 minutes 1) 5.1.0-rc3 [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec 2) 5.1.0-rc3 + rtcache patch [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec 3) 5.1.0-rc3 + disable GRO support [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec 4) 5.1.0-rc3 + rtcache patch + disable GRO support [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec Did you test it with disabling GRO by ethtool -K? Oh, I didn't know about such possibility! I just tested: 1) Kernel with GRO support left in place (no local patch disabling it) 2) ethtool -K eth0 gro off and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably break/fix NAT performance by just calling ethtool -K eth0 gro on/off. Is this the result with your reverting patch? Previous results were coming from kernel with patched vlan_offload_init() - see diff at the end of my first e-mail. It's late night in Japan so I think I will try to reproduce it tomorrow. Thank you!
Re: NAT performance regression caused by vlan GRO support
Hi Rafał, On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote: Hello, I'd like to report a regression that goes back to the 2015. I know it's damn late, but the good thing is, the regression is still easy to reproduce, verify & revert. Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. My hardware is BCM47094 SoC (dual core ARM) with integrated network controller and external BCM53012 switch. Relevant setup: * SoC network controller is wired to the hardware switch * Switch passes 802.1q frames with VID 1 to four LAN ports * Switch passes 802.1q frames with VID 2 to WAN port * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) * Linux uses pfifo and "echo 2 > rps_cpus" * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port * Intel i7-2670QM laptop connected to a WAN port * Speed of LAN to WAN measured using iperf & TCP over 10 minutes 1) 5.1.0-rc3 [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec 2) 5.1.0-rc3 + rtcache patch [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec 3) 5.1.0-rc3 + disable GRO support [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec 4) 5.1.0-rc3 + rtcache patch + disable GRO support [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec Did you test it with disabling GRO by ethtool -K? Is this the result with your reverting patch? It's late night in Japan so I think I will try to reproduce it tomorrow. Thanks. 5) 4.1.15 + rtcache patch 934 Mb/s 6) 4.3.4 + rtcache patch 565 Mb/s As you can see I can achieve a big performance gain by disabling/reverting a GRO support. Getting up to 65% faster NAT makes a huge difference and ideally I'd like to get that with upstream Linux code. Could someone help me and check the reported commit/code, please? Is there any other info I can provide or anything I can test for you? --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -545,6 +545,8 @@ static int __init vlan_offload_init(void) { unsigned int i; + return -ENOTSUPP; + for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++) dev_add_offload(&vlan_packet_offloads[i]);
NAT performance regression caused by vlan GRO support
Hello, I'd like to report a regression that goes back to the 2015. I know it's damn late, but the good thing is, the regression is still easy to reproduce, verify & revert. Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT performance of my router dropped by 30% - 40%. My hardware is BCM47094 SoC (dual core ARM) with integrated network controller and external BCM53012 switch. Relevant setup: * SoC network controller is wired to the hardware switch * Switch passes 802.1q frames with VID 1 to four LAN ports * Switch passes 802.1q frames with VID 2 to WAN port * Linux does NAT for LAN (eth0.1) to WAN (eth0.2) * Linux uses pfifo and "echo 2 > rps_cpus" * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port * Intel i7-2670QM laptop connected to a WAN port * Speed of LAN to WAN measured using iperf & TCP over 10 minutes 1) 5.1.0-rc3 [ 6] 0.0-600.0 sec 39.9 GBytes 572 Mbits/sec 2) 5.1.0-rc3 + rtcache patch [ 6] 0.0-600.0 sec 40.0 GBytes 572 Mbits/sec 3) 5.1.0-rc3 + disable GRO support [ 6] 0.0-300.4 sec 27.5 GBytes 786 Mbits/sec 4) 5.1.0-rc3 + rtcache patch + disable GRO support [ 6] 0.0-600.0 sec 65.6 GBytes 939 Mbits/sec 5) 4.1.15 + rtcache patch 934 Mb/s 6) 4.3.4 + rtcache patch 565 Mb/s As you can see I can achieve a big performance gain by disabling/reverting a GRO support. Getting up to 65% faster NAT makes a huge difference and ideally I'd like to get that with upstream Linux code. Could someone help me and check the reported commit/code, please? Is there any other info I can provide or anything I can test for you? --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -545,6 +545,8 @@ static int __init vlan_offload_init(void) { unsigned int i; + return -ENOTSUPP; + for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++) dev_add_offload(&vlan_packet_offloads[i]); .config Description: application/config