Re: kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps

2018-01-28 Thread Eric Dumazet
On Sun, 2018-01-28 at 19:26 +0100, Paweł Staszewski wrote:
> 
> W dniu 27.01.2018 o 23:23, Paweł Staszewski pisze:
> > Hi
> > 
> > 
> > Today I made some real life traffic tests with kernel 4.15.0-rc9
> > 
> > but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast 
> > from 48% to 100% for all cpu cores.
> > 
> > Here is some graph that presenting how cpu load rises when there was 
> > more pps.
> > 
> > 
> > https://ibb.co/mhD5ob
> > 
> > 
> > here is perf record from that time:
> > 
> > https://pastebin.com/3zqG1rvE
> > 
> > 
> > There is 8x 10G ixgbe 82599 interfaces teamed with teamd.
> > 
> > No traffic queueing - only pfifo fast on all interfaces.
> > 
> > No NAT or iptables forles other than INPUT (about 30rules)
> > 
> > All nic's have same ethtool settings:
> > 
> > ethtool -k eth0
> > Features for eth0:
> > Cannot get device udp-fragmentation-offload settings: Operation not 
> > supported
> > rx-checksumming: on
> > tx-checksumming: on
> >     tx-checksum-ipv4: off [fixed]
> >     tx-checksum-ip-generic: on
> >     tx-checksum-ipv6: off [fixed]
> >     tx-checksum-fcoe-crc: off [fixed]
> >     tx-checksum-sctp: on
> > scatter-gather: on
> >     tx-scatter-gather: on
> >     tx-scatter-gather-fraglist: off [fixed]
> > tcp-segmentation-offload: on
> >     tx-tcp-segmentation: on
> >     tx-tcp-ecn-segmentation: off [fixed]
> >     tx-tcp-mangleid-segmentation: off
> >     tx-tcp6-segmentation: on
> > udp-fragmentation-offload: off
> > generic-segmentation-offload: on
> > generic-receive-offload: on
> > large-receive-offload: off
> > rx-vlan-offload: on
> > tx-vlan-offload: on
> > ntuple-filters: on
> > receive-hashing: on
> > highdma: on [fixed]
> > rx-vlan-filter: on
> > vlan-challenged: off [fixed]
> > tx-lockless: off [fixed]
> > netns-local: off [fixed]
> > tx-gso-robust: off [fixed]
> > tx-fcoe-segmentation: off [fixed]
> > tx-gre-segmentation: on
> > tx-gre-csum-segmentation: on
> > tx-ipxip4-segmentation: on
> > tx-ipxip6-segmentation: on
> > tx-udp_tnl-segmentation: on
> > tx-udp_tnl-csum-segmentation: on
> > tx-gso-partial: on
> > tx-sctp-segmentation: off [fixed]
> > tx-esp-segmentation: off [fixed]
> > fcoe-mtu: off [fixed]
> > tx-nocache-copy: off
> > loopback: off [fixed]
> > rx-fcs: off [fixed]
> > rx-all: off
> > tx-vlan-stag-hw-insert: off [fixed]
> > rx-vlan-stag-hw-parse: off [fixed]
> > rx-vlan-stag-filter: off [fixed]
> > l2-fwd-offload: off
> > hw-tc-offload: off
> > esp-hw-offload: off [fixed]
> > esp-tx-csum-hw-offload: off [fixed]
> > rx-udp_tunnel-port-offload: on
> > 
> > 
> > ethtool -g eth0
> > Ring parameters for eth0:
> > Pre-set maximums:
> > RX: 4096
> > RX Mini:    0
> > RX Jumbo:   0
> > TX: 4096
> > Current hardware settings:
> > RX: 4096
> > RX Mini:    0
> > RX Jumbo:   0
> > TX: 2048
> > 
> > 
> > ethtool -c eth0
> > Coalesce parameters for eth0:
> > Adaptive RX: off  TX: off
> > stats-block-usecs: 0
> > sample-interval: 0
> > pkt-rate-low: 0
> > pkt-rate-high: 0
> > 
> > rx-usecs: 512
> > rx-frames: 0
> > rx-usecs-irq: 0
> > rx-frames-irq: 0
> > 
> > tx-usecs: 0
> > tx-frames: 0
> > tx-usecs-irq: 0
> > tx-frames-irq: 0
> > 
> > rx-usecs-low: 0
> > rx-frame-low: 0
> > tx-usecs-low: 0
> > tx-frame-low: 0
> > 
> > rx-usecs-high: 0
> > rx-frame-high: 0
> > tx-usecs-high: 0
> > tx-frame-high: 0
> > 
> > 
> > 
> > 
> 
> 
> 
> Peft top for kernel 4.15.0-rc9 below (all 40 cores 100% cpu load with 
> 6.3Mpps)
> 
>      20.96%  [kernel]    [k] queued_spin_lock_slowpath
>   5.51%  [kernel]    [k] ixgbe_poll
>   5.49%  [kernel]    [k] ixgbe_xmit_frame_ring
>   4.39%  [kernel]    [k] do_raw_spin_lock
>   4.29%  [kernel]    [k] sch_direct_xmit
>   4.11%  [kernel]    [k] fib_table_lookup
>   3.11%  [team_mode_roundrobin]  [k] rr_transmit
>   2.71%  [kernel]    [k] __dev_queue_xmit
>   2.62%  [kernel]    [k] __ptr_ring_peek
>   2.39%  [kernel]    [k] skb_release_data
>   2.18%  [kernel]    [k] dev_gro_receive
>   1.75%  [kernel]    [k] __qdisc_run
>   1.67%  [kernel]    [k] pfifo_fast_enqueue
>   1.57%  [kernel]    [k] netdev_pick_tx
>   1.56%  [kernel]    [k] page_frag_free
>   1.48%  [kernel]    [k] ip_finish_output2
>   1.38%  [kernel]    [k] __slab_free
>   1.36%  [kernel]    [k] skb_unref
>   1.34%  [kernel]    [k] ixgbe_maybe_stop_tx
>   1.30%  [kernel]    [k] vlan_do_receive
>   1.28%  [kernel]    [k] pfifo_fast_dequeue
>   1.23%  [kernel]    [k] virt_to_head_page
> 
> 
> 
> Same configuration kernel 4.15.0-rc3 (50% cpu load on all 40 cores with 
> 6.3Mpps)
> 
>   7.81%  [kernel]    [k] 

Re: kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps

2018-01-28 Thread Paweł Staszewski



W dniu 27.01.2018 o 23:23, Paweł Staszewski pisze:

Hi


Today I made some real life traffic tests with kernel 4.15.0-rc9

but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast 
from 48% to 100% for all cpu cores.


Here is some graph that presenting how cpu load rises when there was 
more pps.



https://ibb.co/mhD5ob


here is perf record from that time:

https://pastebin.com/3zqG1rvE


There is 8x 10G ixgbe 82599 interfaces teamed with teamd.

No traffic queueing - only pfifo fast on all interfaces.

No NAT or iptables forles other than INPUT (about 30rules)

All nic's have same ethtool settings:

ethtool -k eth0
Features for eth0:
Cannot get device udp-fragmentation-offload settings: Operation not 
supported

rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on


ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 2048


ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 512
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0








Peft top for kernel 4.15.0-rc9 below (all 40 cores 100% cpu load with 
6.3Mpps)


    20.96%  [kernel]    [k] queued_spin_lock_slowpath
 5.51%  [kernel]    [k] ixgbe_poll
 5.49%  [kernel]    [k] ixgbe_xmit_frame_ring
 4.39%  [kernel]    [k] do_raw_spin_lock
 4.29%  [kernel]    [k] sch_direct_xmit
 4.11%  [kernel]    [k] fib_table_lookup
 3.11%  [team_mode_roundrobin]  [k] rr_transmit
 2.71%  [kernel]    [k] __dev_queue_xmit
 2.62%  [kernel]    [k] __ptr_ring_peek
 2.39%  [kernel]    [k] skb_release_data
 2.18%  [kernel]    [k] dev_gro_receive
 1.75%  [kernel]    [k] __qdisc_run
 1.67%  [kernel]    [k] pfifo_fast_enqueue
 1.57%  [kernel]    [k] netdev_pick_tx
 1.56%  [kernel]    [k] page_frag_free
 1.48%  [kernel]    [k] ip_finish_output2
 1.38%  [kernel]    [k] __slab_free
 1.36%  [kernel]    [k] skb_unref
 1.34%  [kernel]    [k] ixgbe_maybe_stop_tx
 1.30%  [kernel]    [k] vlan_do_receive
 1.28%  [kernel]    [k] pfifo_fast_dequeue
 1.23%  [kernel]    [k] virt_to_head_page



Same configuration kernel 4.15.0-rc3 (50% cpu load on all 40 cores with 
6.3Mpps)


 7.81%  [kernel]    [k] ixgbe_xmit_frame_ring
 7.61%  [kernel]    [k] ixgbe_poll
 7.09%  [kernel]    [k] do_raw_spin_lock
 5.63%  [kernel]    [k] fib_table_lookup
 5.19%  [kernel]    [k] __dev_queue_xmit
 4.38%  [team_mode_roundrobin]  [k] rr_transmit
 3.10%  [kernel]    [k] netdev_pick_tx
 2.79%  [kernel]    [k] skb_release_data
 2.34%  [kernel]    [k] dev_gro_receive
 1.99%  [kernel]    [k] page_frag_free
 1.96%  [kernel]    [k] skb_unref
 1.92%  [kernel]    [k] virt_to_head_page
 1.90%  [kernel]    [k] ixgbe_maybe_stop_tx
 

kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps

2018-01-27 Thread Paweł Staszewski

Hi


Today I made some real life traffic tests with kernel 4.15.0-rc9

but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast 
from 48% to 100% for all cpu cores.


Here is some graph that presenting how cpu load rises when there was 
more pps.



https://ibb.co/mhD5ob


here is perf record from that time:

https://pastebin.com/3zqG1rvE


There is 8x 10G ixgbe 82599 interfaces teamed with teamd.

No traffic queueing - only pfifo fast on all interfaces.

No NAT or iptables forles other than INPUT (about 30rules)

All nic's have same ethtool settings:

ethtool -k eth0
Features for eth0:
Cannot get device udp-fragmentation-offload settings: Operation not 
supported

rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on


ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 2048


ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 512
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0