RE: NAT performance regression caused by vlan GRO support

2019-04-08 Thread David Laight
From: Rafal Milecki
> Sent: 07 April 2019 12:55
...
> If not, maybe w really need to think about some good & clever condition for
> disabling GRO by default on hw without checksum offloading.

Maybe GRO could assume the checksums are valid so the checksum
would only be verified when the packet is delivered locally.

If the packet is forwarded then, provided the same packet
boundaries are used, the original checksums (maybe modified
by NAT) can be used.

No idea how easy this might be :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: NAT performance regression caused by vlan GRO support

2019-04-07 Thread Rafał Miłecki

Now I have some questions regarding possible optimizations. Note I'm too
familiar with the net subsystem so maybe I got wrong ideas.

On 07.04.2019 13:53, Rafał Miłecki wrote:

On 04.04.2019 14:57, Rafał Miłecki wrote:

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.


I'll try to provide some summary for this issue. I'll focus on TCP traffic as
that's what I happened to test.

Basically all slowdowns are related to the csum_partial(). Calculating checksum
has a significant impact on NAT performance on less CPU powerful devices.

**

GRO disabled

Without GRO a csum_partial() is used only when validating TCP packets in the
nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1).

Simplified forward trace for that case:
nf_conntrack_in
 nf_conntrack_tcp_packet
     tcp_error
     if (state->net->ct.sysctl_checksum)
     nf_checksum
     nf_ip_checksum
     __skb_checksum_complete

That validation can be disabled using nf_conntrack_checksum sysfs and it bumps
NAT speed for me from 666 Mb/s to 940 Mb/s (+41%).

**

GRO enabled

First of all GRO also includes TCP validation that requires calculating a
checksum.

Simplified forward trace for that case:
vlan_gro_receive
 call_gro_receive
     inet_gro_receive
     indirect_call_gro_receive
     tcp4_gro_receive
     skb_gro_checksum_validate
     tcp_gro_receive

*If* we had a way to disable that validation it *would* result in bumping NAT
speed for me from 577 Mb/s to 825 Mb/s (+43%).


Could we have tcp4_gro_receive() behave similarly to the tcp_error() and make it
respect the nf_conntrack_checksum sysfs value?

Could we simply add something like:
if (dev_net(skb->dev)->ct.sysctl_checksum)
to it (to additionally protect a skb_gro_checksum_validate() call)?



Secondly using GRO means we need to calculate a checksum before transmitting
packets (applies to devices without HW checksum offloading). I think it's
related to packets merging in the skb_gro_receive() and then setting
CHECKSUM_PARTIAL:

vlan_gro_complete
 inet_gro_complete
     tcp4_gro_complete
     tcp_gro_complete
     skb->ip_summed = CHECKSUM_PARTIAL;

That results in bgmac calculating a checksum from the scratch, take a look at
the bgmac_dma_tx_add() which does:

if (skb->ip_summed == CHECKSUM_PARTIAL)
 skb_checksum_help(skb);

Performing that whole checksum calculation will always result in GRO slowing
down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs.


Is this possible to avoid CHECKSUM_PARTIAL & skb_checksum_help() which has to
calculate a whole checksum? It's definitely possible to *update* checksum after
simple packet changes (e.g. amending an IP or port). Would that be possible to
use similar method when dealing with packets with GRO enabled?

If not, maybe w really need to think about some good & clever condition for
disabling GRO by default on hw without checksum offloading.


Re: NAT performance regression caused by vlan GRO support

2019-04-07 Thread Rafał Miłecki

On 04.04.2019 14:57, Rafał Miłecki wrote:

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.


I'll try to provide some summary for this issue. I'll focus on TCP traffic as
that's what I happened to test.

Basically all slowdowns are related to the csum_partial(). Calculating checksum
has a significant impact on NAT performance on less CPU powerful devices.

**

GRO disabled

Without GRO a csum_partial() is used only when validating TCP packets in the
nf_conntrack_tcp_packet() (known as tcp_packet() in kernels older than 5.1).

Simplified forward trace for that case:
nf_conntrack_in
nf_conntrack_tcp_packet
tcp_error
if (state->net->ct.sysctl_checksum)
nf_checksum
nf_ip_checksum
__skb_checksum_complete

That validation can be disabled using nf_conntrack_checksum sysfs and it bumps
NAT speed for me from 666 Mb/s to 940 Mb/s (+41%).

**

GRO enabled

First of all GRO also includes TCP validation that requires calculating a
checksum.

Simplified forward trace for that case:
vlan_gro_receive
call_gro_receive
inet_gro_receive
indirect_call_gro_receive
tcp4_gro_receive
skb_gro_checksum_validate
tcp_gro_receive

*If* we had a way to disable that validation it *would* result in bumping NAT
speed for me from 577 Mb/s to 825 Mb/s (+43%).


Secondly using GRO means we need to calculate a checksum before transmitting
packets (applies to devices without HW checksum offloading). I think it's
related to packets merging in the skb_gro_receive() and then setting
CHECKSUM_PARTIAL:

vlan_gro_complete
inet_gro_complete
tcp4_gro_complete
tcp_gro_complete
skb->ip_summed = CHECKSUM_PARTIAL;

That results in bgmac calculating a checksum from the scratch, take a look at
the bgmac_dma_tx_add() which does:

if (skb->ip_summed == CHECKSUM_PARTIAL)
skb_checksum_help(skb);

Performing that whole checksum calculation will always result in GRO slowing
down NAT for me when using BCM47094 SoC with that not-so-powerful ARM CPUs.


Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Eric Dumazet



On 04/05/2019 03:51 AM, Florian Westphal wrote:
> Toke Høiland-Jørgensen  wrote:
>> As a first approximation, maybe just:
>>
>> if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps)
>>   disable_gro();
> 
> I don't think its a good idea.  For local delivery case, there is no
> way to avoid the checksum cost, so might as well have GRO enabled.
> 

We might add a sysctl or a way to tell GRO layer :

Do not attempt checksumming if forwarding is enabled on the host.

Basically GRO if NIC has provided checksum offload.



Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Florian Westphal
Toke Høiland-Jørgensen  wrote:
> As a first approximation, maybe just:
> 
> if (!has_hardware_cksum_offload(netdev) && link_rate(netdev) <= 1Gbps)
>   disable_gro();

I don't think its a good idea.  For local delivery case, there is no
way to avoid the checksum cost, so might as well have GRO enabled.


Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Toke Høiland-Jørgensen
Toshiaki Makita  writes:

> On 2019/04/05 16:14, Felix Fietkau wrote:
>> On 2019-04-05 09:11, Rafał Miłecki wrote:
>>> On 05.04.2019 07:48, Rafał Miłecki wrote:
 On 05.04.2019 06:26, Toshiaki Makita wrote:
> My test results:
>
> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
> Measured TCP throughput by netperf.
>
> GRO on : 17 Gbps
> GRO off:  5 Gbps
>
> So I failed to reproduce your problem.

 :( Thanks for trying & checking that!


> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
> your machine?

 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
 root@OpenWrt:/# mpstat -P ALL 10 3
 Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)

 16:33:40 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:33:50 all    0.00    0.00    0.00    0.00    0.00   58.79    0.00   
  0.00   41.21
 16:33:50   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00   
  0.00    0.00
 16:33:50   1    0.00    0.00    0.00    0.00    0.00   17.58    0.00   
  0.00   82.42

 16:33:50 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:34:00 all    0.00    0.00    0.05    0.00    0.00   59.44    0.00   
  0.00   40.51
 16:34:00   0    0.00    0.00    0.10    0.00    0.00   99.90    0.00   
  0.00    0.00
 16:34:00   1    0.00    0.00    0.00    0.00    0.00   18.98    0.00   
  0.00   81.02

 16:34:00 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:34:10 all    0.00    0.00    0.00    0.00    0.00   59.59    0.00   
  0.00   40.41
 16:34:10   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00   
  0.00    0.00
 16:34:10   1    0.00    0.00    0.00    0.00    0.00   19.18    0.00   
  0.00   80.82

 Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 Average: all    0.00    0.00    0.02    0.00    0.00   59.27    0.00   
  0.00   40.71
 Average:   0    0.00    0.00    0.03    0.00    0.00   99.97    0.00   
  0.00    0.00
 Average:   1    0.00    0.00    0.00    0.00    0.00   18.58    0.00   
  0.00   81.42


 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
 root@OpenWrt:/# mpstat -P ALL 10 3
 Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)

 16:34:39 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:34:49 all    0.00    0.00    0.05    0.00    0.00   86.91    0.00   
  0.00   13.04
 16:34:49   0    0.00    0.00    0.10    0.00    0.00   78.22    0.00   
  0.00   21.68
 16:34:49   1    0.00    0.00    0.00    0.00    0.00   95.60    0.00   
  0.00    4.40

 16:34:49 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:34:59 all    0.00    0.00    0.10    0.00    0.00   87.06    0.00   
  0.00   12.84
 16:34:59   0    0.00    0.00    0.20    0.00    0.00   79.72    0.00   
  0.00   20.08
 16:34:59   1    0.00    0.00    0.00    0.00    0.00   94.41    0.00   
  0.00    5.59

 16:34:59 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:35:09 all    0.00    0.00    0.05    0.00    0.00   85.71    0.00   
  0.00   14.24
 16:35:09   0    0.00    0.00    0.10    0.00    0.00   79.42    0.00   
  0.00   20.48
 16:35:09   1    0.00    0.00    0.00    0.00    0.00   92.01    0.00   
  0.00    7.99

 Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 Average: all    0.00    0.00    0.07    0.00    0.00   86.56    0.00   
  0.00   13.37
 Average:   0    0.00    0.00    0.13    0.00    0.00   79.12    0.00   
  0.00   20.75
 Average:   1    0.00    0.00    0.00    0.00    0.00   94.01    0.00   
  0.00    5.99


 3) System idle (no iperf)
 root@OpenWrt:/# mpstat -P ALL 10 1
 Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)

 16:35:31 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
 %guest   %idle
 16:35:41 all    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
  0.00  100.00
 16:35:41   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
  0.00  100.00
 16:35:41   1    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
  0.00  100.00
>

Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Rafał Miłecki

On 05.04.2019 10:12, Rafał Miłecki wrote:

On 05.04.2019 09:58, Toshiaki Makita wrote:

On 2019/04/05 16:14, Felix Fietkau wrote:

On 2019-04-05 09:11, Rafał Miłecki wrote:

I guess its GRO + csum_partial() to be blamed for this performance drop.

Maybe csum_partial() is very fast on your powerful machine and few extra calls
don't make a difference? I can imagine it affecting much slower home router with
ARM cores.

Most high performance Ethernet devices implement hardware checksum
offload, which completely gets rid of this overhead.
Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
why you're getting such crappy performance.


Hmm... now I disabled rx checksum and tried the test again, and indeed I
see csum_partial from GRO path. But I also see csum_partial even without
GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete.
Probably Rafał disabled nf_conntrack_checksum sysctl knob?

But anyway even with disabling rx csum offload my machine has better
performance with GRO. I'm sure in some cases GRO should be disabled, but
I guess it's difficult to determine whether we should disable GRO or not
automatically when csum offload is not available.


Few testing results:

1) ethtool -K eth0 gro off; echo 0 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  6.57 GBytes   940 Mbits/sec

2) ethtool -K eth0 gro off; echo 1 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.65 GBytes   666 Mbits/sec


For this case (GRO off and nf_conntrack_checksum enabled) I can confirm I see
csum_partial() in the perf output. It's taking 13,14% instead of 25,46% (as when
using GRO) though.

Samples: 38K of event 'cycles', Event count (approx.): 12209908413
  Overhead  Command  Shared Object   Symbol
+   13,14%  ksoftirqd/1  [kernel.kallsyms]   [k] csum_partial
+   10,16%  swapper  [kernel.kallsyms]   [k] v7_dma_inv_range
+6,36%  swapper  [kernel.kallsyms]   [k] l2c210_inv_range
+4,89%  swapper  [kernel.kallsyms]   [k] __irqentry_text_end
+4,12%  ksoftirqd/1  [kernel.kallsyms]   [k] v7_dma_clean_range
+3,78%  swapper  [kernel.kallsyms]   [k] bcma_host_soc_read32
+2,76%  swapper  [kernel.kallsyms]   [k] arch_cpu_idle
+2,45%  ksoftirqd/1  [kernel.kallsyms]   [k] 
__netif_receive_skb_core
+2,37%  ksoftirqd/1  [kernel.kallsyms]   [k] l2c210_clean_range
+1,76%  ksoftirqd/1  [kernel.kallsyms]   [k] bgmac_start_xmit
+1,66%  swapper  [kernel.kallsyms]   [k] bgmac_poll
+1,55%  ksoftirqd/1  [kernel.kallsyms]   [k] __dev_queue_xmit
+1,11%  ksoftirqd/1  [kernel.kallsyms]   [k] skb_vlan_untag



3) ethtool -K eth0 gro on; echo 0 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.02 GBytes   575 Mbits/sec

4) ethtool -K eth0 gro on; echo 1 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.04 GBytes   579 Mbits/sec


Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Rafał Miłecki

On 05.04.2019 09:58, Toshiaki Makita wrote:

On 2019/04/05 16:14, Felix Fietkau wrote:

On 2019-04-05 09:11, Rafał Miłecki wrote:

I guess its GRO + csum_partial() to be blamed for this performance drop.

Maybe csum_partial() is very fast on your powerful machine and few extra calls
don't make a difference? I can imagine it affecting much slower home router with
ARM cores.

Most high performance Ethernet devices implement hardware checksum
offload, which completely gets rid of this overhead.
Unfortunately, the BCM53xx/47xx Ethernet MAC doesn't have this, which is
why you're getting such crappy performance.


Hmm... now I disabled rx checksum and tried the test again, and indeed I
see csum_partial from GRO path. But I also see csum_partial even without
GRO from nf_conntrack_in -> tcp_packet -> __skb_checksum_complete.
Probably Rafał disabled nf_conntrack_checksum sysctl knob?

But anyway even with disabling rx csum offload my machine has better
performance with GRO. I'm sure in some cases GRO should be disabled, but
I guess it's difficult to determine whether we should disable GRO or not
automatically when csum offload is not available.


Few testing results:

1) ethtool -K eth0 gro off; echo 0 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  6.57 GBytes   940 Mbits/sec

2) ethtool -K eth0 gro off; echo 1 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.65 GBytes   666 Mbits/sec

3) ethtool -K eth0 gro on; echo 0 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.02 GBytes   575 Mbits/sec

4) ethtool -K eth0 gro on; echo 1 > 
/proc/sys/net/netfilter/nf_conntrack_checksum
[  6]  0.0-60.0 sec  4.04 GBytes   579 Mbits/sec


Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Toshiaki Makita
On 2019/04/05 16:14, Felix Fietkau wrote:
> On 2019-04-05 09:11, Rafał Miłecki wrote:
>> On 05.04.2019 07:48, Rafał Miłecki wrote:
>>> On 05.04.2019 06:26, Toshiaki Makita wrote:
 My test results:

 Receiving packets from eth0.10, forwarding them to eth0.20 and applying
 MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
 Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
 Measured TCP throughput by netperf.

 GRO on : 17 Gbps
 GRO off:  5 Gbps

 So I failed to reproduce your problem.
>>>
>>> :( Thanks for trying & checking that!
>>>
>>>
 Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
 -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
 your machine?
>>>
>>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
>>> root@OpenWrt:/# mpstat -P ALL 10 3
>>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>>>
>>> 16:33:40 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:33:50 all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    
>>> 0.00   41.21
>>> 16:33:50   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
>>> 0.00    0.00
>>> 16:33:50   1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    
>>> 0.00   82.42
>>>
>>> 16:33:50 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:34:00 all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    
>>> 0.00   40.51
>>> 16:34:00   0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    
>>> 0.00    0.00
>>> 16:34:00   1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    
>>> 0.00   81.02
>>>
>>> 16:34:00 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:34:10 all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    
>>> 0.00   40.41
>>> 16:34:10   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
>>> 0.00    0.00
>>> 16:34:10   1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    
>>> 0.00   80.82
>>>
>>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> Average: all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    
>>> 0.00   40.71
>>> Average:   0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    
>>> 0.00    0.00
>>> Average:   1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    
>>> 0.00   81.42
>>>
>>>
>>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
>>> root@OpenWrt:/# mpstat -P ALL 10 3
>>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>>>
>>> 16:34:39 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:34:49 all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    
>>> 0.00   13.04
>>> 16:34:49   0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    
>>> 0.00   21.68
>>> 16:34:49   1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    
>>> 0.00    4.40
>>>
>>> 16:34:49 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:34:59 all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    
>>> 0.00   12.84
>>> 16:34:59   0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    
>>> 0.00   20.08
>>> 16:34:59   1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    
>>> 0.00    5.59
>>>
>>> 16:34:59 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:35:09 all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    
>>> 0.00   14.24
>>> 16:35:09   0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    
>>> 0.00   20.48
>>> 16:35:09   1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    
>>> 0.00    7.99
>>>
>>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> Average: all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    
>>> 0.00   13.37
>>> Average:   0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    
>>> 0.00   20.75
>>> Average:   1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    
>>> 0.00    5.99
>>>
>>>
>>> 3) System idle (no iperf)
>>> root@OpenWrt:/# mpstat -P ALL 10 1
>>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>>>
>>> 16:35:31 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> 16:35:41 all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>>> 0.00  100.00
>>> 16:35:41   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>>> 0.00  100.00
>>> 16:35:41   1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>>> 0.00  100.00
>>>
>>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>>> %guest   %idle
>>> Average: all    0.00    0.00    0.0

Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Rafał Miłecki

On 05.04.2019 07:48, Rafał Miłecki wrote:

On 05.04.2019 06:26, Toshiaki Makita wrote:

My test results:

Receiving packets from eth0.10, forwarding them to eth0.20 and applying
MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
Measured TCP throughput by netperf.

GRO on : 17 Gbps
GRO off:  5 Gbps

So I failed to reproduce your problem.


:( Thanks for trying & checking that!



Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
-u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
your machine?


1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
root@OpenWrt:/# mpstat -P ALL 10 3
Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)

16:33:40 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:33:50 all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    
0.00   41.21
16:33:50   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
0.00    0.00
16:33:50   1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    
0.00   82.42

16:33:50 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:34:00 all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    
0.00   40.51
16:34:00   0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    
0.00    0.00
16:34:00   1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    
0.00   81.02

16:34:00 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:34:10 all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    
0.00   40.41
16:34:10   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
0.00    0.00
16:34:10   1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    
0.00   80.82

Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
Average: all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    
0.00   40.71
Average:   0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    
0.00    0.00
Average:   1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    
0.00   81.42


2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
root@OpenWrt:/# mpstat -P ALL 10 3
Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)

16:34:39 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:34:49 all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    
0.00   13.04
16:34:49   0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    
0.00   21.68
16:34:49   1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    
0.00    4.40

16:34:49 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:34:59 all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    
0.00   12.84
16:34:59   0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    
0.00   20.08
16:34:59   1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    
0.00    5.59

16:34:59 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:35:09 all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    
0.00   14.24
16:35:09   0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    
0.00   20.48
16:35:09   1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    
0.00    7.99

Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
Average: all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    
0.00   13.37
Average:   0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    
0.00   20.75
Average:   1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    
0.00    5.99


3) System idle (no iperf)
root@OpenWrt:/# mpstat -P ALL 10 1
Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)

16:35:31 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
16:35:41 all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00
16:35:41   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00
16:35:41   1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00

Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
%guest   %idle
Average: all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00
Average:   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00
Average:   1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00



If CPU is 100%, perf may help us analyze your problem. If it's
available, try running below while testing:
# perf record -a -g -- sleep 5

And then run this after testing:
# perf report --no-child


I can see my CPU 0 is fully loaded when using "gro on". I'll try perf now.


I guess its GRO + csum_partial() to be

Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Felix Fietkau
On 2019-04-05 09:11, Rafał Miłecki wrote:
> On 05.04.2019 07:48, Rafał Miłecki wrote:
>> On 05.04.2019 06:26, Toshiaki Makita wrote:
>>> My test results:
>>>
>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
>>> Measured TCP throughput by netperf.
>>>
>>> GRO on : 17 Gbps
>>> GRO off:  5 Gbps
>>>
>>> So I failed to reproduce your problem.
>> 
>> :( Thanks for trying & checking that!
>> 
>> 
>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
>>> your machine?
>> 
>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
>> root@OpenWrt:/# mpstat -P ALL 10 3
>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>> 
>> 16:33:40 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:33:50 all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    
>> 0.00   41.21
>> 16:33:50   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
>> 0.00    0.00
>> 16:33:50   1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    
>> 0.00   82.42
>> 
>> 16:33:50 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:00 all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    
>> 0.00   40.51
>> 16:34:00   0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    
>> 0.00    0.00
>> 16:34:00   1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    
>> 0.00   81.02
>> 
>> 16:34:00 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:10 all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    
>> 0.00   40.41
>> 16:34:10   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
>> 0.00    0.00
>> 16:34:10   1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    
>> 0.00   80.82
>> 
>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> Average: all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    
>> 0.00   40.71
>> Average:   0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    
>> 0.00    0.00
>> Average:   1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    
>> 0.00   81.42
>> 
>> 
>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
>> root@OpenWrt:/# mpstat -P ALL 10 3
>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>> 
>> 16:34:39 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:49 all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    
>> 0.00   13.04
>> 16:34:49   0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    
>> 0.00   21.68
>> 16:34:49   1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    
>> 0.00    4.40
>> 
>> 16:34:49 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:59 all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    
>> 0.00   12.84
>> 16:34:59   0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    
>> 0.00   20.08
>> 16:34:59   1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    
>> 0.00    5.59
>> 
>> 16:34:59 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:35:09 all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    
>> 0.00   14.24
>> 16:35:09   0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    
>> 0.00   20.48
>> 16:35:09   1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    
>> 0.00    7.99
>> 
>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> Average: all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    
>> 0.00   13.37
>> Average:   0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    
>> 0.00   20.75
>> Average:   1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    
>> 0.00    5.99
>> 
>> 
>> 3) System idle (no iperf)
>> root@OpenWrt:/# mpstat -P ALL 10 1
>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>> 
>> 16:35:31 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:35:41 all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> 16:35:41   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> 16:35:41   1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> 
>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> Average: all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> Average:   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  10

Re: NAT performance regression caused by vlan GRO support

2019-04-04 Thread Rafał Miłecki

On 05.04.2019 06:26, Toshiaki Makita wrote:

On 2019/04/05 5:22, Rafał Miłecki wrote:

On 04.04.2019 17:17, Toshiaki Makita wrote:

On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:

I'd like to report a regression that goes back to the 2015. I know
it's damn
late, but the good thing is, the regression is still easy to
reproduce, verify &
revert.

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add
GRO support
for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.

My hardware is BCM47094 SoC (dual core ARM) with integrated network
controller
and external BCM53012 switch.

Relevant setup:
* SoC network controller is wired to the hardware switch
* Switch passes 802.1q frames with VID 1 to four LAN ports
* Switch passes 802.1q frames with VID 2 to WAN port
* Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
* Linux uses pfifo and "echo 2 > rps_cpus"
* Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
* Intel i7-2670QM laptop connected to a WAN port
* Speed of LAN to WAN measured using iperf & TCP over 10 minutes

1) 5.1.0-rc3
[  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec

2) 5.1.0-rc3 + rtcache patch
[  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec

3) 5.1.0-rc3 + disable GRO support
[  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec

4) 5.1.0-rc3 + rtcache patch + disable GRO support
[  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec


Did you test it with disabling GRO by ethtool -K?


Oh, I didn't know about such possibility! I just tested:
1) Kernel with GRO support left in place (no local patch disabling it)
2) ethtool -K eth0 gro off
and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably
break/fix NAT performance by just calling ethtool -K eth0 gro on/off.



Is this the result with your reverting patch?


Previous results were coming from kernel with patched
vlan_offload_init() - see
diff at the end of my first e-mail.



It's late night in Japan so I think I will try to reproduce it tomorrow.


My test results:

Receiving packets from eth0.10, forwarding them to eth0.20 and applying
MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
Measured TCP throughput by netperf.

GRO on : 17 Gbps
GRO off:  5 Gbps

So I failed to reproduce your problem.


:( Thanks for trying & checking that!



Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
-u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
your machine?


1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
root@OpenWrt:/# mpstat -P ALL 10 3
Linux 5.1.0-rc3+ (OpenWrt)  03/27/19_armv7l_(2 CPU)

16:33:40 CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
16:33:50 all0.000.000.000.000.00   58.790.00
0.00   41.21
16:33:50   00.000.000.000.000.00  100.000.00
0.000.00
16:33:50   10.000.000.000.000.00   17.580.00
0.00   82.42

16:33:50 CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
16:34:00 all0.000.000.050.000.00   59.440.00
0.00   40.51
16:34:00   00.000.000.100.000.00   99.900.00
0.000.00
16:34:00   10.000.000.000.000.00   18.980.00
0.00   81.02

16:34:00 CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
16:34:10 all0.000.000.000.000.00   59.590.00
0.00   40.41
16:34:10   00.000.000.000.000.00  100.000.00
0.000.00
16:34:10   10.000.000.000.000.00   19.180.00
0.00   80.82

Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
Average: all0.000.000.020.000.00   59.270.00
0.00   40.71
Average:   00.000.000.030.000.00   99.970.00
0.000.00
Average:   10.000.000.000.000.00   18.580.00
0.00   81.42


2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
root@OpenWrt:/# mpstat -P ALL 10 3
Linux 5.1.0-rc3+ (OpenWrt)  03/27/19_armv7l_(2 CPU)

16:34:39 CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
16:34:49 all0.000.000.050.000.00   86.910.00
0.00   13.04
16:34:49   00.000.000.100.000.00   78.220.00
0.00   21.68
16:34:49   10.000.000.000.000.00   95.600.00
0.004.40

16:34:49 CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
16:34:59 all0.000.000.100.000.00   87.060.00
0.00   12.84
16:34:59   00.000.000.200.000.00   79.720.00
0.00   20.08
16:34:59   10.000.00   

Re: NAT performance regression caused by vlan GRO support

2019-04-04 Thread Toshiaki Makita
On 2019/04/05 5:22, Rafał Miłecki wrote:
> On 04.04.2019 17:17, Toshiaki Makita wrote:
>> On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:
>>> I'd like to report a regression that goes back to the 2015. I know
>>> it's damn
>>> late, but the good thing is, the regression is still easy to
>>> reproduce, verify &
>>> revert.
>>>
>>> Long story short, starting with the commit 66e5133f19e9 ("vlan: Add
>>> GRO support
>>> for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
>>> performance of my router dropped by 30% - 40%.
>>>
>>> My hardware is BCM47094 SoC (dual core ARM) with integrated network
>>> controller
>>> and external BCM53012 switch.
>>>
>>> Relevant setup:
>>> * SoC network controller is wired to the hardware switch
>>> * Switch passes 802.1q frames with VID 1 to four LAN ports
>>> * Switch passes 802.1q frames with VID 2 to WAN port
>>> * Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
>>> * Linux uses pfifo and "echo 2 > rps_cpus"
>>> * Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
>>> * Intel i7-2670QM laptop connected to a WAN port
>>> * Speed of LAN to WAN measured using iperf & TCP over 10 minutes
>>>
>>> 1) 5.1.0-rc3
>>> [  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec
>>>
>>> 2) 5.1.0-rc3 + rtcache patch
>>> [  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec
>>>
>>> 3) 5.1.0-rc3 + disable GRO support
>>> [  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec
>>>
>>> 4) 5.1.0-rc3 + rtcache patch + disable GRO support
>>> [  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec
>>
>> Did you test it with disabling GRO by ethtool -K?
> 
> Oh, I didn't know about such possibility! I just tested:
> 1) Kernel with GRO support left in place (no local patch disabling it)
> 2) ethtool -K eth0 gro off
> and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably
> break/fix NAT performance by just calling ethtool -K eth0 gro on/off.
> 
> 
>> Is this the result with your reverting patch?
> 
> Previous results were coming from kernel with patched
> vlan_offload_init() - see
> diff at the end of my first e-mail.
> 
> 
>> It's late night in Japan so I think I will try to reproduce it tomorrow.

My test results:

Receiving packets from eth0.10, forwarding them to eth0.20 and applying
MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
Measured TCP throughput by netperf.

GRO on : 17 Gbps
GRO off:  5 Gbps

So I failed to reproduce your problem.

Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
-u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
your machine?

If CPU is 100%, perf may help us analyze your problem. If it's
available, try running below while testing:
# perf record -a -g -- sleep 5

And then run this after testing:
# perf report --no-child

-- 
Toshiaki Makita



Re: NAT performance regression caused by vlan GRO support

2019-04-04 Thread Rafał Miłecki

On 04.04.2019 17:17, Toshiaki Makita wrote:

On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:

I'd like to report a regression that goes back to the 2015. I know it's damn
late, but the good thing is, the regression is still easy to reproduce, verify &
revert.

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.

My hardware is BCM47094 SoC (dual core ARM) with integrated network controller
and external BCM53012 switch.

Relevant setup:
* SoC network controller is wired to the hardware switch
* Switch passes 802.1q frames with VID 1 to four LAN ports
* Switch passes 802.1q frames with VID 2 to WAN port
* Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
* Linux uses pfifo and "echo 2 > rps_cpus"
* Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
* Intel i7-2670QM laptop connected to a WAN port
* Speed of LAN to WAN measured using iperf & TCP over 10 minutes

1) 5.1.0-rc3
[  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec

2) 5.1.0-rc3 + rtcache patch
[  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec

3) 5.1.0-rc3 + disable GRO support
[  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec

4) 5.1.0-rc3 + rtcache patch + disable GRO support
[  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec


Did you test it with disabling GRO by ethtool -K?


Oh, I didn't know about such possibility! I just tested:
1) Kernel with GRO support left in place (no local patch disabling it)
2) ethtool -K eth0 gro off
and it bumped my NAT performance from 576 Mb/s to 939 Mb/s. I can reliably
break/fix NAT performance by just calling ethtool -K eth0 gro on/off.



Is this the result with your reverting patch?


Previous results were coming from kernel with patched vlan_offload_init() - see
diff at the end of my first e-mail.



It's late night in Japan so I think I will try to reproduce it tomorrow.


Thank you!


Re: NAT performance regression caused by vlan GRO support

2019-04-04 Thread Toshiaki Makita

Hi Rafał,

On 19/04/04 (木) 21:57:15, Rafał Miłecki wrote:

Hello,

I'd like to report a regression that goes back to the 2015. I know it's 
damn
late, but the good thing is, the regression is still easy to reproduce, 
verify &

revert.

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO 
support

for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.

My hardware is BCM47094 SoC (dual core ARM) with integrated network 
controller

and external BCM53012 switch.

Relevant setup:
* SoC network controller is wired to the hardware switch
* Switch passes 802.1q frames with VID 1 to four LAN ports
* Switch passes 802.1q frames with VID 2 to WAN port
* Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
* Linux uses pfifo and "echo 2 > rps_cpus"
* Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
* Intel i7-2670QM laptop connected to a WAN port
* Speed of LAN to WAN measured using iperf & TCP over 10 minutes

1) 5.1.0-rc3
[  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec

2) 5.1.0-rc3 + rtcache patch
[  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec

3) 5.1.0-rc3 + disable GRO support
[  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec

4) 5.1.0-rc3 + rtcache patch + disable GRO support
[  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec


Did you test it with disabling GRO by ethtool -K?
Is this the result with your reverting patch?

It's late night in Japan so I think I will try to reproduce it tomorrow.

Thanks.



5) 4.1.15 + rtcache patch
934 Mb/s

6) 4.3.4 + rtcache patch
565 Mb/s

As you can see I can achieve a big performance gain by 
disabling/reverting a
GRO support. Getting up to 65% faster NAT makes a huge difference and 
ideally

I'd like to get that with upstream Linux code.

Could someone help me and check the reported commit/code, please? Is there
any other info I can provide or anything I can test for you?


--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -545,6 +545,8 @@ static int __init vlan_offload_init(void)
  {
  unsigned int i;

+    return -ENOTSUPP;
+
  for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
  dev_add_offload(&vlan_packet_offloads[i]);


NAT performance regression caused by vlan GRO support

2019-04-04 Thread Rafał Miłecki

Hello,

I'd like to report a regression that goes back to the 2015. I know it's damn
late, but the good thing is, the regression is still easy to reproduce, verify &
revert.

Long story short, starting with the commit 66e5133f19e9 ("vlan: Add GRO support
for non hardware accelerated vlan") - which first hit kernel 4.2 - NAT
performance of my router dropped by 30% - 40%.

My hardware is BCM47094 SoC (dual core ARM) with integrated network controller
and external BCM53012 switch.

Relevant setup:
* SoC network controller is wired to the hardware switch
* Switch passes 802.1q frames with VID 1 to four LAN ports
* Switch passes 802.1q frames with VID 2 to WAN port
* Linux does NAT for LAN (eth0.1) to WAN (eth0.2)
* Linux uses pfifo and "echo 2 > rps_cpus"
* Ryzen 5 PRO 2500U (x86_64) laptop connected to a LAN port
* Intel i7-2670QM laptop connected to a WAN port
* Speed of LAN to WAN measured using iperf & TCP over 10 minutes

1) 5.1.0-rc3
[  6]  0.0-600.0 sec  39.9 GBytes   572 Mbits/sec

2) 5.1.0-rc3 + rtcache patch
[  6]  0.0-600.0 sec  40.0 GBytes   572 Mbits/sec

3) 5.1.0-rc3 + disable GRO support
[  6]  0.0-300.4 sec  27.5 GBytes   786 Mbits/sec

4) 5.1.0-rc3 + rtcache patch + disable GRO support
[  6]  0.0-600.0 sec  65.6 GBytes   939 Mbits/sec

5) 4.1.15 + rtcache patch
934 Mb/s

6) 4.3.4 + rtcache patch
565 Mb/s

As you can see I can achieve a big performance gain by disabling/reverting a
GRO support. Getting up to 65% faster NAT makes a huge difference and ideally
I'd like to get that with upstream Linux code.

Could someone help me and check the reported commit/code, please? Is there
any other info I can provide or anything I can test for you?


--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -545,6 +545,8 @@ static int __init vlan_offload_init(void)
 {
unsigned int i;

+   return -ENOTSUPP;
+
for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
dev_add_offload(&vlan_packet_offloads[i]);



.config
Description: application/config