Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-10-18 o 23:54, Eric Dumazet pisze: On Wed, 2017-10-18 at 23:49 +0200, Paweł Staszewski wrote: How far it is from applying this to the kernel ? So far im using this on all my servers from about 3 months now without problems It is a hack, and does not support properly bonding/team. ( If the real_dev->privflags IFF_XMIT_DST_RELEASE bit changes, we want to update all the vlans at the same time ) We need something more sophisticated, and I had no time to spend on this topic recently. ok
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Wed, 2017-10-18 at 23:49 +0200, Paweł Staszewski wrote: > How far it is from applying this to the kernel ? > > So far im using this on all my servers from about 3 months now without > problems It is a hack, and does not support properly bonding/team. ( If the real_dev->privflags IFF_XMIT_DST_RELEASE bit changes, we want to update all the vlans at the same time ) We need something more sophisticated, and I had no time to spend on this topic recently.
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-09-21 o 23:41, Florian Fainelli pisze: On 09/21/2017 02:26 PM, Paweł Staszewski wrote: W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? Would not this apply to pretty much any stacked device setup though? It seems like any network device that just queues up its packet on another physical device for actual transmission may need that (e.g: DSA, bond, team, more.?) How far it is from applying this to the kernel ? So far im using this on all my servers from about 3 months now without problems
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Thu, 2017-09-21 at 15:07 -0700, Florian Fainelli wrote: > On 09/21/2017 02:54 PM, Eric Dumazet wrote: > > On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote: > > > >> Would not this apply to pretty much any stacked device setup though? It > >> seems like any network device that just queues up its packet on another > >> physical device for actual transmission may need that (e.g: DSA, bond, > >> team, more.?) > > > > We support bonding and team already. > > Right, so that seems to mostly leave us with DSA at least. What about > other devices that also have IFF_NO_QUEUE set? It wont work. loopback has IFF_NO_QUEUE, but you need to keep dst on skbs...
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On 09/21/2017 02:54 PM, Eric Dumazet wrote: > On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote: > >> Would not this apply to pretty much any stacked device setup though? It >> seems like any network device that just queues up its packet on another >> physical device for actual transmission may need that (e.g: DSA, bond, >> team, more.?) > > We support bonding and team already. Right, so that seems to mostly leave us with DSA at least. What about other devices that also have IFF_NO_QUEUE set? -- Florian
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote: > Would not this apply to pretty much any stacked device setup though? It > seems like any network device that just queues up its packet on another > physical device for actual transmission may need that (e.g: DSA, bond, > team, more.?) We support bonding and team already.
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-09-21 o 23:41, Florian Fainelli pisze: On 09/21/2017 02:26 PM, Paweł Staszewski wrote: W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? Would not this apply to pretty much any stacked device setup though? It seems like any network device that just queues up its packet on another physical device for actual transmission may need that (e.g: DSA, bond, team, more.?) Some devices libe bond have it. Just maybee when there was first patch vlans were not taken into account. Did not checked all :) But I know Eric will do :)
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On 09/21/2017 02:26 PM, Paweł Staszewski wrote: > > > W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: >> diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c >> index >> 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 >> 100644 >> --- a/net/8021q/vlan_netlink.c >> +++ b/net/8021q/vlan_netlink.c >> @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, >> struct net_device *dev, >> vlan->vlan_proto = proto; >> vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); >> vlan->real_dev = real_dev; >> +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); >> vlan->flags = VLAN_FLAG_REORDER_HDR; >> err = vlan_check_real_dev(real_dev, vlan->vlan_proto, >> vlan->vlan_id); > > Any plans for this patch to go normal into the kernel ? Would not this apply to pretty much any stacked device setup though? It seems like any network device that just queues up its packet on another physical device for actual transmission may need that (e.g: DSA, bond, team, more.?) -- Florian
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-09-21 o 23:34, Eric Dumazet pisze: On Thu, 2017-09-21 at 23:26 +0200, Paweł Staszewski wrote: W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? So far im using it for about 3 weeks on all my linux based routers - and no problems. Yes, I was about to submit it, as I mentioned it few hours ago to you ;) Yes i saw Your point 2) in previous emails :) But there was no patch in previous reply for this so was thinking that maybee too many things to do and You forgot about it :) Thanks Paweł
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Thu, 2017-09-21 at 23:26 +0200, Paweł Staszewski wrote: > > W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: > > diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c > > index > > 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 > > > > 100644 > > --- a/net/8021q/vlan_netlink.c > > +++ b/net/8021q/vlan_netlink.c > > @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, > > struct net_device *dev, > > vlan->vlan_proto = proto; > > vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); > > vlan->real_dev = real_dev; > > +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); > > vlan->flags = VLAN_FLAG_REORDER_HDR; > > err = vlan_check_real_dev(real_dev, vlan->vlan_proto, > > vlan->vlan_id); > > Any plans for this patch to go normal into the kernel ? > > So far im using it for about 3 weeks on all my linux based routers - and > no problems. Yes, I was about to submit it, as I mentioned it few hours ago to you ;)
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; + dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? So far im using it for about 3 weeks on all my linux based routers - and no problems.
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Another test for this patch with linux-next tree with patch: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == vlan1004: 1.00 P/s 606842.31 P/s 606843.31 P/s lo: 0.00 P/s 0.00 P/s 0.00 P/s vlan1016: 0.00 P/s 607730.56 P/s 607730.56 P/s vlan1020: 0.00 P/s 606891.25 P/s 606891.25 P/s vlan1018: 0.00 P/s 607580.88 P/s 607580.88 P/s vlan1014: 0.00 P/s 607606.81 P/s 607606.81 P/s vlan1005: 0.00 P/s 606788.44 P/s 606788.44 P/s enp2s0f0: 2.00 P/s 2.00 P/s 3.99 P/s vlan1017: 0.00 P/s 607643.75 P/s 607643.75 P/s enp132s0: 13079658.00 P/s 0.00 P/s 13079658.00 P/s vlan1000: 0.00 P/s 604409.19 P/s 604409.19 P/s vlan1010: 0.00 P/s 606984.06 P/s 606984.06 P/s vlan1019: 0.00 P/s 607452.12 P/s 607452.12 P/s vlan1008: 0.00 P/s 606803.44 P/s 606803.44 P/s vlan1011: 0.00 P/s 607048.94 P/s 607048.94 P/s vlan1001: 0.00 P/s 606773.50 P/s 606773.50 P/s vlan1006: 0.00 P/s 606811.38 P/s 606811.38 P/s vlan1012: 0.00 P/s 607051.94 P/s 607051.94 P/s vlan1013: 0.00 P/s 607067.88 P/s 607067.88 P/s enp4s0: 2.00 P/s 13020803.00 P/s 13020805.00 P/s vlan1007: 0.00 P/s 606798.44 P/s 606798.44 P/s vlan1002: 0.00 P/s 606840.31 P/s 606840.31 P/s vlan1009: 0.00 P/s 606809.38 P/s 606809.38 P/s enp2s0f1: 100.80 P/s 0.00 P/s 100.80 P/s vlan1015: 0.00 P/s 607089.81 P/s 607089.81 P/s vlan1003: 1.00 P/s 606928.19 P/s 606929.19 P/s -- total: 13079765.00 P/s 25766758.00 P/s 38846524.00 P/s 13Mpps forwarded (32cores active for two mlx5 nics) 80% cpu load (20%idle all cores) PerfTop: 126552 irqs/sec kernel:99.3% exact: 0.0% [4000Hz cycles], (all, 32 CPUs) --- 8.25% [kernel] [k] fib_table_lookup 7.98% [kernel] [k] do_raw_spin_lock 6.20% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 4.21% [kernel] [k] mlx5e_xmit 3.37% [kernel] [k] __dev_queue_xmit 2.95% [kernel] [k] ip_rcv 2.72% [kernel] [k] ipt_do_table 2.24% [kernel] [k] ip_finish_output2 2.22% [kernel] [k] __netif_receive_skb_core 2.17% [kernel] [k] ip_forward 2.15% [kernel] [k] __build_skb 1.99% [kernel] [k] ip_route_input_rcu 1.70% [kernel] [k] mlx5e_txwqe_complete 1.54% [kernel] [k] dev_gro_receive 1.45% [kernel] [k] mlx5_cqwq_get_cqe 1.38% [kernel] [k] udp_v4_early_demux 1.35% [kernel] [k] netif_skb_features 1.33% [kernel] [k] inet_gro_receive 1.29% [kernel] [k] dev_hard_start_xmit 1.27% [kernel] [k] ip_rcv_finish 1.19% [kernel] [k] mlx5e_build_rx_skb 1.15% [kernel] [k] __netdev_pick_tx 1.11% [kernel] [k] kmem_cache_alloc 1.09% [kernel] [k] mlx5e_poll_tx_cq 1.07% [kernel] [k] mlx5e_txwqe_build_dsegs 1.00% [kernel] [k] vlan_dev_hard_start_xmit 0.90% [kernel] [k] __napi_alloc_skb 0.87% [kernel] [k] validate_xmit_skb 0.87% [kernel] [k] read_tsc 0.83% [kernel] [k] napi_gro_receive 0.79% [kernel] [k] skb_network_protocol 0.79% [kernel] [k] sch_direct_xmit 0.78% [kernel] [k] __local_bh_enable_ip 0.78% [kernel] [k] netdev_pick_tx 0.75% [kernel] [k] __udp4_lib_lookup 0.72% [kernel] [k] netif_receive_skb_internal 0.71% [kernel] [k] page_frag_free 0.71% [kernel] [k] deliver_ptype_list_skb 0.70% [kernel] [k] fib_validate_source 0.69% [kernel] [k] mlx5_cqwq_get_cqe 0.69% [kernel] [k] __netif_receive_skb 0.68% [kernel] [k] vlan_passthru_hard_header 0.61% [kernel] [k] rt_cache_valid 0.59% [kernel] [k] iptable_filter_hook Without patch: 12,7Mpps forwarded (32cor
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Tested with connectx-5 Without patch 10Mpps - > 16 cores used PerfTop: 66258 irqs/sec kernel:99.3% exact: 0.0% [4000Hz cycles], (all, 32 CPUs) --- 10.12% [kernel] [k] do_raw_spin_lock 6.31% [kernel] [k] fib_table_lookup 6.12% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 4.90% [kernel] [k] rt_cache_valid 3.99% [kernel] [k] mlx5e_xmit 3.03% [kernel] [k] ip_rcv 2.68% [kernel] [k] __netif_receive_skb_core 2.54% [kernel] [k] skb_dst_force 2.41% [kernel] [k] ip_finish_output2 2.21% [kernel] [k] __build_skb 2.03% [kernel] [k] __dev_queue_xmit 1.96% [kernel] [k] mlx5e_txwqe_complete 1.79% [kernel] [k] ipt_do_table 1.78% [kernel] [k] inet_gro_receive 1.69% [kernel] [k] ip_forward 1.66% [kernel] [k] udp_v4_early_demux 1.65% [kernel] [k] dst_release 1.56% [kernel] [k] ip_rcv_finish 1.45% [kernel] [k] dev_gro_receive 1.45% [kernel] [k] netif_skb_features 1.39% [kernel] [k] mlx5e_poll_tx_cq 1.35% [kernel] [k] mlx5e_txwqe_build_dsegs 1.35% [kernel] [k] ip_route_input_rcu 1.15% [kernel] [k] dev_hard_start_xmit 1.12% [kernel] [k] napi_gro_receive 1.07% [kernel] [k] netif_receive_skb_internal 0.98% [kernel] [k] sch_direct_xmit 0.95% [kernel] [k] kmem_cache_alloc 0.89% [kernel] [k] read_tsc 0.88% [kernel] [k] mlx5e_build_rx_skb 0.86% [kernel] [k] mlx5_cqwq_get_cqe 0.82% [kernel] [k] page_frag_free 0.78% [kernel] [k] __local_bh_enable_ip 0.69% [kernel] [k] skb_network_protocol 0.68% [kernel] [k] __netif_receive_skb 0.67% [kernel] [k] vlan_dev_hard_start_xmit 0.65% [kernel] [k] mlx5e_poll_rx_cq 0.65% [kernel] [k] validate_xmit_skb 0.60% [kernel] [k] eth_type_trans 0.60% [kernel] [k] deliver_ptype_list_skb 0.60% [kernel] [k] fib_validate_source 0.55% [kernel] [k] eth_header 0.53% [kernel] [k] netdev_pick_tx 0.53% [kernel] [k] __napi_alloc_skb 0.51% [kernel] [k] __udp4_lib_lookup 0.50% [kernel] [k] eth_type_vlan 0.49% [kernel] [k] ip_output 0.49% [kernel] [k] page_frag_alloc 0.49% [kernel] [k] ip_finish_output 0.48% [kernel] [k] neigh_connected_output 0.45% [kernel] [k] nf_hook_slow 0.44% [kernel] [k] udp4_gro_receive 0.39% [kernel] [k] mlx5e_features_check 0.39% [kernel] [k] mlx5e_napi_poll 0.37% [kernel] [k] __jhash_nwords 0.37% [kernel] [k] udp_gro_receive 0.36% [kernel] [k] swiotlb_map_page 0.33% [kernel] [k] mlx5_cqwq_get_wqe 0.33% [kernel] [k] __netdev_pick_tx 0.29% [kernel] [k] ktime_get_with_offset 0.29% [kernel] [k] get_dma_ops 0.29% [kernel] [k] validate_xmit_skb_list 0.26% [kernel] [k] vlan_passthru_hard_header 0.26% [kernel] [k] __udp4_lib_lookup_skb 0.24% [kernel] [k] get_dma_ops 0.24% [kernel] [k] skb_release_data 0.23% [kernel] [k] ip_forward_finish 0.23% [kernel] [k] kmem_cache_free_bulk 0.23% [kernel] [k] timekeeping_get_ns 0.22% [kernel] [k] ip_skb_dst_mtu 0.21% [kernel] [k] compound_head 0.20% [kernel] [k] skb_gro_reset_offset 0.20% [kernel] [k] is_swiotlb_buffer 0.19% [kernel] [k] __net_timestamp.isra.90 0.19% [kernel] [k] dst_metric.constprop.61 0.18% [kernel] [k] skb_orphan_frags.constprop.126 0.18% [kernel] [k] _kfree_skb_defer 0.18% [kernel] [k] irq_entries_start 0.17% [kernel] [k] dev_hard_header.constprop.54 0.17% [kernel] [k] dma_mapping_error 0.17% [kernel] [k] neigh_resolve_output With patch 12Mpps -> 16 cores PerfTop: 66209 irqs/sec kernel:99.3% exact: 0.0% [4000Hz cycles], (all, 32 CPUs) --- 10.67% [kernel] [k] do_raw_spin_lock 6.96% [kernel] [k] fib_table_lookup 6.53% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 4.17% [kernel] [k] mlx5e_xmit 3.22% [kernel] [k] ip_rcv 3.07% [kernel] [k] __netif_receive_skb_core 2.86% [kernel] [k] __dev_queu
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Hi Are there any plans to have this fix normally in kernel ? Or it is mostly only hack - not longterm fix and need to be different ? All tests that was done shows that without this patch there is about 20-30% network forwarding performance degradation when using vlan interfaces Thanks Paweł W dniu 2017-08-15 o 03:17, Eric Dumazet pisze: On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote: Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev. Something like : diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev= real_dev; + dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Tue, 15 Aug 2017 12:05:37 +0200 Paweł Staszewski wrote: > W dniu 2017-08-15 o 12:02, Paweł Staszewski pisze: > > W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze: > >> On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski > >> wrote: > >>> W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze: > On Tue, 15 Aug 2017 02:38:56 +0200 > Paweł Staszewski wrote: > > W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: > >> On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski > >> wrote: [... cut ...] > >>> Ethtool(enp175s0f1) stat: 8895566 ( 8,895,566) <= > >>> tx_prio0_packets /sec > >>> Ethtool(enp175s0f1) stat:640470657 (640,470,657) <= > >>> tx_vport_unicast_bytes /sec > >>> Ethtool(enp175s0f1) stat: 8895427 ( 8,895,427) <= > >>> tx_vport_unicast_packets /sec > >>> Ethtool(enp175s0f1) stat: 498 (498) <= tx_xmit_more > >>> /sec > >> > >> We are seeing some xmit_more, this is interesting. Have you noticed, > >> if (in the VLAN case) there is a queue in the qdisc layer? > >> > >> Simply inspect with: tc -s qdisc show dev ixgbe2 [...] > > physical interface mq attached with pfifo_fast: > > > > tc -s -d qdisc show dev enp175s0f1 > > qdisc mq 0: root > > Sent 1397200697212 bytes 3965888669 pkt (dropped 78065663, overlimits 0 > > requeues 629868) > > backlog 0b 0p requeues 629868 > > qdisc pfifo_fast 0: parent :38 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 > > 1 1 > > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > > backlog 0b 0p requeues 0 > > qdisc pfifo_fast 0: parent :37 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 > > 1 1 > > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > > backlog 0b 0p requeues 0 [...] So, it doesn't look like there is any backlog queue. Although, this can be difficult to measure/see this way (as the kernel empty the queue quickly via bulk deq), also given the small amount of xmit_more which indicate that the queue was likely very small. There is a "dropped" counter, which indicate that you likely had a setup (earlier) where you managed to overflow the qdisc queues. > just see that after changing RSS on nics did't deleted qdisc and added > again: > Here situation with qdisc del / add > tc -s -d qdisc show dev enp175s0f1 > qdisc mq 1: root > Sent 43738523966 bytes 683414438 pkt (dropped 0, overlimits 0 requeues 1886) > backlog 0b 0p requeues 1886 > qdisc pfifo_fast 0: parent 1:10 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 > 1 1 > Sent 2585011904 bytes 40390811 pkt (dropped 0, overlimits 0 requeues 110) > backlog 0b 0p requeues 110 > qdisc pfifo_fast 0: parent 1:f bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 > 1 > Sent 2602068416 bytes 40657319 pkt (dropped 0, overlimits 0 requeues 121) > backlog 0b 0p requeues 121 [...] Exactly as you indicated above, these "dropped" stats came from another (earlier) test case. (Great that you caught this yourself) While trying to reproduce you case, I also managed to cause a situation with qdisc overload. This caused some weird behavior, where I saw RX=8Mpps and TX only 4Mpps. (I didn't figure out the exact tuning that caused this, and cannot reproduce it now). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-15 o 12:02, Paweł Staszewski pisze: W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze: On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski wrote: W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze: On Tue, 15 Aug 2017 02:38:56 +0200 Paweł Staszewski wrote: W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski wrote: To show some difference below comparision vlan/no-vlan traffic 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan I'm trying to reproduce in my testlab (with ixgbe). I do see, a performance reduction of about 10-19% when I forward out a VLAN interface. This is larger than I expected, but still lower than what you reported 30-40% slowdown. [...] Ok mellanox afrrived (MT27700 - mlnx5 driver) And to compare melannox with vlans and without: 33% performance degradation (less than with ixgbe where i reach ~40% with same settings) Mellanox without TX traffix on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;11089305;709715520;8871553;567779392 1;16;64;11096292;710162688;11095566;710116224 2;16;64;11095770;710129280;11096799;710195136 3;16;64;11097199;710220736;11097702;710252928 4;16;64;11080984;567081856;11079662;709098368 5;16;64;11077696;708972544;11077039;708930496 6;16;64;11082991;709311424;8864802;567347328 7;16;64;11089596;709734144;8870927;709789184 8;16;64;11094043;710018752;11095391;710105024 Mellanox with TX traffic on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;7369914;471674496;7370281;471697980 1;16;64;7368896;471609408;7368043;471554752 2;16;64;7367577;471524864;7367759;471536576 3;16;64;7368744;377305344;7369391;471641024 4;16;64;7366824;471476736;7364330;471237120 5;16;64;7368352;471574528;7367239;471503296 6;16;64;7367459;471517376;7367806;471539584 7;16;64;7367190;471500160;7367988;471551232 8;16;64;7368023;471553472;7368076;471556864 I wonder if the drivers page recycler is active/working or not, and if the situation is different between VLAN vs no-vlan (given page_frag_free is so high in you perf top). The Mellanox drivers fortunately have a stats counter to tell us this explicitly (which the ixgbe driver doesn't). You can use my ethtool_stats.pl script watch these stats: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl (Hint perl dependency: dnf install perl-Time-HiRes) For RX NIC: Show adapter(s) (enp175s0f0) statistics (ONLY that changed!) Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec Ethtool(enp175s0f0) stat: 230978 (230,978) <= rx0_cache_reuse /sec Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_csum_complete /sec Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_packets /sec Ethtool(enp175s0f0) stat: 921614 (921,614) <= rx0_page_reuse /sec Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec Ethtool(enp175s0f0) stat: 233343 (233,343) <= rx1_cache_reuse /sec Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_csum_complete /sec Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_packets /sec Ethtool(enp175s0f0) stat: 927793 (927,793) <= rx1_page_reuse /sec Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec Ethtool(enp175s0f0) stat: 233735 (233,735) <= rx2_cache_reuse /sec Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_csum_complete /sec Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_packets /sec Ethtool(enp175s0f0) stat: 937989 (937,989) <= rx2_page_reuse /sec Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec Ethtool(enp175s0f0) stat: 230311 (230,311) <= rx3_cache_reuse /sec Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_csum_complete /sec Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_packets /sec Ethtool(enp175s0f0) stat: 922513 (922,513) <= rx3_page_reuse /sec Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec Ethtool(enp175s0f0) stat: 191969 (191,969) <= rx4_cache_reuse /sec Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_csum_complete /sec Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_packets /sec Ethtool(enp175s0f0) stat: 766332 (766,332) <= rx4_page_reuse /sec Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec Ethtool(enp175s0f0) stat: 197150 (197,150) <= rx5_cache_reuse /sec Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_csum_complete /sec Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_packets /sec Ethtool(enp175s0f0) stat: 786978 (786,978) <= rx5_page_reuse /sec Ethtool(enp175s0f0) stat: 79076984 ( 79,076,984) <= rx6_b
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Tue, 15 Aug 2017 11:11:57 +0200 Paweł Staszewski wrote: > Yes it helped - now there is almost no difference when using vlans or not: > > 10.5Mpps - with vlan > > 11Mpps - without vlan Great! - it seems like we have pinpointed the root-cause. It also demonstrate how big the benefit is of Eric commit (thanks!): https://git.kernel.org/torvalds/c/93f154b594fe > W dniu 2017-08-15 o 03:17, Eric Dumazet pisze: > > On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote: > > > >> Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev. > > Something like : > > > > diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c > > index > > 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 > > 100644 > > --- a/net/8021q/vlan_netlink.c > > +++ b/net/8021q/vlan_netlink.c > > @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct > > net_device *dev, > > vlan->vlan_proto = proto; > > vlan->vlan_id= nla_get_u16(data[IFLA_VLAN_ID]); > > vlan->real_dev = real_dev; > > + dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); > > vlan->flags = VLAN_FLAG_REORDER_HDR; > > > > err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze: On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski wrote: W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze: On Tue, 15 Aug 2017 02:38:56 +0200 Paweł Staszewski wrote: W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski wrote: To show some difference below comparision vlan/no-vlan traffic 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan I'm trying to reproduce in my testlab (with ixgbe). I do see, a performance reduction of about 10-19% when I forward out a VLAN interface. This is larger than I expected, but still lower than what you reported 30-40% slowdown. [...] Ok mellanox afrrived (MT27700 - mlnx5 driver) And to compare melannox with vlans and without: 33% performance degradation (less than with ixgbe where i reach ~40% with same settings) Mellanox without TX traffix on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;11089305;709715520;8871553;567779392 1;16;64;11096292;710162688;11095566;710116224 2;16;64;11095770;710129280;11096799;710195136 3;16;64;11097199;710220736;11097702;710252928 4;16;64;11080984;567081856;11079662;709098368 5;16;64;11077696;708972544;11077039;708930496 6;16;64;11082991;709311424;8864802;567347328 7;16;64;11089596;709734144;8870927;709789184 8;16;64;11094043;710018752;11095391;710105024 Mellanox with TX traffic on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;7369914;471674496;7370281;471697980 1;16;64;7368896;471609408;7368043;471554752 2;16;64;7367577;471524864;7367759;471536576 3;16;64;7368744;377305344;7369391;471641024 4;16;64;7366824;471476736;7364330;471237120 5;16;64;7368352;471574528;7367239;471503296 6;16;64;7367459;471517376;7367806;471539584 7;16;64;7367190;471500160;7367988;471551232 8;16;64;7368023;471553472;7368076;471556864 I wonder if the drivers page recycler is active/working or not, and if the situation is different between VLAN vs no-vlan (given page_frag_free is so high in you perf top). The Mellanox drivers fortunately have a stats counter to tell us this explicitly (which the ixgbe driver doesn't). You can use my ethtool_stats.pl script watch these stats: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl (Hint perl dependency: dnf install perl-Time-HiRes) For RX NIC: Show adapter(s) (enp175s0f0) statistics (ONLY that changed!) Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec Ethtool(enp175s0f0) stat: 230978 (230,978) <= rx0_cache_reuse /sec Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_csum_complete /sec Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_packets /sec Ethtool(enp175s0f0) stat: 921614 (921,614) <= rx0_page_reuse /sec Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec Ethtool(enp175s0f0) stat: 233343 (233,343) <= rx1_cache_reuse /sec Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_csum_complete /sec Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_packets /sec Ethtool(enp175s0f0) stat: 927793 (927,793) <= rx1_page_reuse /sec Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec Ethtool(enp175s0f0) stat: 233735 (233,735) <= rx2_cache_reuse /sec Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_csum_complete /sec Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_packets /sec Ethtool(enp175s0f0) stat: 937989 (937,989) <= rx2_page_reuse /sec Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec Ethtool(enp175s0f0) stat: 230311 (230,311) <= rx3_cache_reuse /sec Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_csum_complete /sec Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_packets /sec Ethtool(enp175s0f0) stat: 922513 (922,513) <= rx3_page_reuse /sec Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec Ethtool(enp175s0f0) stat: 191969 (191,969) <= rx4_cache_reuse /sec Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_csum_complete /sec Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_packets /sec Ethtool(enp175s0f0) stat: 766332 (766,332) <= rx4_page_reuse /sec Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec Ethtool(enp175s0f0) stat: 197150 (197,150) <= rx5_cache_reuse /sec Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_csum_complete /sec Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_packets /sec Ethtool(enp175s0f0) stat: 786978 (786,978) <= rx5_page_reuse /sec Ethtool(enp175s0f0) stat: 79076984 ( 79,076,984) <= rx6_bytes /sec Ethtool(enp175s0f0) stat: 233735 (233,735) <=
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski wrote: > W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze: > > On Tue, 15 Aug 2017 02:38:56 +0200 > > Paweł Staszewski wrote: > > > >> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: > >>> On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski > >>> wrote: > >>> > To show some difference below comparision vlan/no-vlan traffic > > 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan > >>> I'm trying to reproduce in my testlab (with ixgbe). I do see, a > >>> performance reduction of about 10-19% when I forward out a VLAN > >>> interface. This is larger than I expected, but still lower than what > >>> you reported 30-40% slowdown. > >>> > >>> [...] > >> Ok mellanox afrrived (MT27700 - mlnx5 driver) > >> And to compare melannox with vlans and without: 33% performance > >> degradation (less than with ixgbe where i reach ~40% with same settings) > >> > >> Mellanox without TX traffix on vlan: > >> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX > >> 0;16;64;11089305;709715520;8871553;567779392 > >> 1;16;64;11096292;710162688;11095566;710116224 > >> 2;16;64;11095770;710129280;11096799;710195136 > >> 3;16;64;11097199;710220736;11097702;710252928 > >> 4;16;64;11080984;567081856;11079662;709098368 > >> 5;16;64;11077696;708972544;11077039;708930496 > >> 6;16;64;11082991;709311424;8864802;567347328 > >> 7;16;64;11089596;709734144;8870927;709789184 > >> 8;16;64;11094043;710018752;11095391;710105024 > >> > >> Mellanox with TX traffic on vlan: > >> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX > >> 0;16;64;7369914;471674496;7370281;471697980 > >> 1;16;64;7368896;471609408;7368043;471554752 > >> 2;16;64;7367577;471524864;7367759;471536576 > >> 3;16;64;7368744;377305344;7369391;471641024 > >> 4;16;64;7366824;471476736;7364330;471237120 > >> 5;16;64;7368352;471574528;7367239;471503296 > >> 6;16;64;7367459;471517376;7367806;471539584 > >> 7;16;64;7367190;471500160;7367988;471551232 > >> 8;16;64;7368023;471553472;7368076;471556864 > > I wonder if the drivers page recycler is active/working or not, and if > > the situation is different between VLAN vs no-vlan (given > > page_frag_free is so high in you perf top). The Mellanox drivers > > fortunately have a stats counter to tell us this explicitly (which the > > ixgbe driver doesn't). > > > > You can use my ethtool_stats.pl script watch these stats: > > > > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > > (Hint perl dependency: dnf install perl-Time-HiRes) > For RX NIC: > Show adapter(s) (enp175s0f0) statistics (ONLY that changed!) > Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec > Ethtool(enp175s0f0) stat: 230978 (230,978) <= rx0_cache_reuse > /sec > Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_csum_complete > /sec > Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_packets /sec > Ethtool(enp175s0f0) stat: 921614 (921,614) <= rx0_page_reuse > /sec > Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec > Ethtool(enp175s0f0) stat: 233343 (233,343) <= rx1_cache_reuse > /sec > Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_csum_complete > /sec > Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_packets /sec > Ethtool(enp175s0f0) stat: 927793 (927,793) <= rx1_page_reuse > /sec > Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec > Ethtool(enp175s0f0) stat: 233735 (233,735) <= rx2_cache_reuse > /sec > Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_csum_complete > /sec > Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_packets /sec > Ethtool(enp175s0f0) stat: 937989 (937,989) <= rx2_page_reuse > /sec > Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec > Ethtool(enp175s0f0) stat: 230311 (230,311) <= rx3_cache_reuse > /sec > Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_csum_complete > /sec > Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_packets /sec > Ethtool(enp175s0f0) stat: 922513 (922,513) <= rx3_page_reuse > /sec > Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec > Ethtool(enp175s0f0) stat: 191969 (191,969) <= rx4_cache_reuse > /sec > Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_csum_complete > /sec > Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_packets /sec > Ethtool(enp175s0f0) stat: 766332 (766,332) <= rx4_page_reuse > /sec > Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec > Ethtool(enp175s0f0) stat: 197150 (197,150) <= rx5_cache_reuse > /sec > Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_csum_c
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Mon, 14 Aug 2017 18:57:50 +0200 Paolo Abeni wrote: > On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote: > > The output (extracted below) didn't show who called 'do_raw_spin_lock', > > BUT it showed another interesting thing. The kernel code > > __dev_queue_xmit() in might create route dst-cache problem for itself(?), > > as it will first call skb_dst_force() and then skb_dst_drop() when the > > packet is transmitted on a VLAN. > > > > static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) > > { > > [...] > > /* If device/qdisc don't need skb->dst, release it right now while > > * its hot in this cpu cache. > > */ > > if (dev->priv_flags & IFF_XMIT_DST_RELEASE) > > skb_dst_drop(skb); > > else > > skb_dst_force(skb); > > I think that the high impact of the above code in this specific test is > mostly due to the following: > > - ingress packets with different RSS rx hash lands on different CPUs > - but they use the same dst entry, since the destination IPs belong to > the same subnet > - the dst refcnt cacheline is contented between all the CPUs Good point and explanation Paolo :-) I changed my pktgen setup to be closer to Pawel's to provoke this situation some more, and I get closer to provoke this although not as clearly as Pawel. A perf diff does show, that the overhead in the VLAN cause originates from the routing "dst_release" code. Diff Baseline==non-vlan case. [jbrouer@canyon ~]$ sudo ~/perf diff # Event 'cycles' # # Baseline Delta Abs Shared Object Symbol # . . # 3.23% +4.32% [kernel.vmlinux] [k] __dev_queue_xmit +3.43% [kernel.vmlinux] [k] dst_release 13.54% -3.17% [kernel.vmlinux] [k] fib_table_lookup 9.33% -2.73% [kernel.vmlinux] [k] _raw_spin_lock 7.91% -1.75% [ixgbe] [k] ixgbe_poll +1.64% [8021q] [k] vlan_dev_hard_start_xmit 7.23% -1.26% [ixgbe] [k] ixgbe_xmit_frame_ring 3.34% -1.10% [kernel.vmlinux] [k] eth_type_trans 5.20% +0.97% [kernel.vmlinux] [k] ip_route_input_rcu 1.13% +0.95% [kernel.vmlinux] [k] ip_rcv_finish 2.49% -0.82% [kernel.vmlinux] [k] ip_forward 3.05% -0.80% [kernel.vmlinux] [k] __build_skb 0.44% +0.74% [kernel.vmlinux] [k] __netif_receive_skb +0.71% [kernel.vmlinux] [k] neigh_connected_output 1.70% +0.68% [kernel.vmlinux] [k] validate_xmit_skb 1.42% +0.67% [kernel.vmlinux] [k] dev_hard_start_xmit 0.49% +0.66% [kernel.vmlinux] [k] netif_receive_skb_internal +0.62% [kernel.vmlinux] [k] eth_header +0.57% [ixgbe] [k] ixgbe_tx_ctxtdesc 1.19% -0.55% [kernel.vmlinux] [k] __netdev_pick_tx 2.54% -0.48% [kernel.vmlinux] [k] fib_validate_source 2.83% +0.46% [kernel.vmlinux] [k] ip_finish_output2 1.45% +0.45% [kernel.vmlinux] [k] netif_skb_features 1.66% -0.45% [kernel.vmlinux] [k] napi_gro_receive 0.90% -0.40% [kernel.vmlinux] [k] validate_xmit_skb_list 1.45% -0.39% [kernel.vmlinux] [k] ip_finish_output +0.36% [8021q] [k] vlan_passthru_hard_header 1.28% -0.33% [kernel.vmlinux] [k] netdev_pick_tx > Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE > flag for vlan if the underlaying device does not have (relevant) > classifier attached? (and clearing it as needed) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze: On Tue, 15 Aug 2017 02:38:56 +0200 Paweł Staszewski wrote: W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski wrote: To show some difference below comparision vlan/no-vlan traffic 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan I'm trying to reproduce in my testlab (with ixgbe). I do see, a performance reduction of about 10-19% when I forward out a VLAN interface. This is larger than I expected, but still lower than what you reported 30-40% slowdown. [...] Ok mellanox afrrived (MT27700 - mlnx5 driver) And to compare melannox with vlans and without: 33% performance degradation (less than with ixgbe where i reach ~40% with same settings) Mellanox without TX traffix on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;11089305;709715520;8871553;567779392 1;16;64;11096292;710162688;11095566;710116224 2;16;64;11095770;710129280;11096799;710195136 3;16;64;11097199;710220736;11097702;710252928 4;16;64;11080984;567081856;11079662;709098368 5;16;64;11077696;708972544;11077039;708930496 6;16;64;11082991;709311424;8864802;567347328 7;16;64;11089596;709734144;8870927;709789184 8;16;64;11094043;710018752;11095391;710105024 Mellanox with TX traffic on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;7369914;471674496;7370281;471697980 1;16;64;7368896;471609408;7368043;471554752 2;16;64;7367577;471524864;7367759;471536576 3;16;64;7368744;377305344;7369391;471641024 4;16;64;7366824;471476736;7364330;471237120 5;16;64;7368352;471574528;7367239;471503296 6;16;64;7367459;471517376;7367806;471539584 7;16;64;7367190;471500160;7367988;471551232 8;16;64;7368023;471553472;7368076;471556864 I wonder if the drivers page recycler is active/working or not, and if the situation is different between VLAN vs no-vlan (given page_frag_free is so high in you perf top). The Mellanox drivers fortunately have a stats counter to tell us this explicitly (which the ixgbe driver doesn't). You can use my ethtool_stats.pl script watch these stats: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl (Hint perl dependency: dnf install perl-Time-HiRes) For RX NIC: Show adapter(s) (enp175s0f0) statistics (ONLY that changed!) Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec Ethtool(enp175s0f0) stat: 230978 (230,978) <= rx0_cache_reuse /sec Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_csum_complete /sec Ethtool(enp175s0f0) stat: 1152648 ( 1,152,648) <= rx0_packets /sec Ethtool(enp175s0f0) stat: 921614 (921,614) <= rx0_page_reuse /sec Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec Ethtool(enp175s0f0) stat: 233343 (233,343) <= rx1_cache_reuse /sec Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_csum_complete /sec Ethtool(enp175s0f0) stat: 1161126 ( 1,161,126) <= rx1_packets /sec Ethtool(enp175s0f0) stat: 927793 (927,793) <= rx1_page_reuse /sec Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec Ethtool(enp175s0f0) stat: 233735 (233,735) <= rx2_cache_reuse /sec Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_csum_complete /sec Ethtool(enp175s0f0) stat: 1171722 ( 1,171,722) <= rx2_packets /sec Ethtool(enp175s0f0) stat: 937989 (937,989) <= rx2_page_reuse /sec Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec Ethtool(enp175s0f0) stat: 230311 (230,311) <= rx3_cache_reuse /sec Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_csum_complete /sec Ethtool(enp175s0f0) stat: 1152837 ( 1,152,837) <= rx3_packets /sec Ethtool(enp175s0f0) stat: 922513 (922,513) <= rx3_page_reuse /sec Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec Ethtool(enp175s0f0) stat: 191969 (191,969) <= rx4_cache_reuse /sec Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_csum_complete /sec Ethtool(enp175s0f0) stat: 958317 (958,317) <= rx4_packets /sec Ethtool(enp175s0f0) stat: 766332 (766,332) <= rx4_page_reuse /sec Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec Ethtool(enp175s0f0) stat: 197150 (197,150) <= rx5_cache_reuse /sec Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_csum_complete /sec Ethtool(enp175s0f0) stat: 984128 (984,128) <= rx5_packets /sec Ethtool(enp175s0f0) stat: 786978 (786,978) <= rx5_page_reuse /sec Ethtool(enp175s0f0) stat: 79076984 ( 79,076,984) <= rx6_bytes /sec Ethtool(enp175s0f0) stat: 233735 (233,735) <= rx6_cache_reuse /sec Ethtool(enp175s0f0) stat: 1162897 ( 1,162,897) <= rx6_csum_complete /
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Tue, 15 Aug 2017 02:38:56 +0200 Paweł Staszewski wrote: > W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: > > On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski > > wrote: > > > >> To show some difference below comparision vlan/no-vlan traffic > >> > >> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan > > I'm trying to reproduce in my testlab (with ixgbe). I do see, a > > performance reduction of about 10-19% when I forward out a VLAN > > interface. This is larger than I expected, but still lower than what > > you reported 30-40% slowdown. > > > > [...] > Ok mellanox afrrived (MT27700 - mlnx5 driver) > And to compare melannox with vlans and without: 33% performance > degradation (less than with ixgbe where i reach ~40% with same settings) > > Mellanox without TX traffix on vlan: > ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX > 0;16;64;11089305;709715520;8871553;567779392 > 1;16;64;11096292;710162688;11095566;710116224 > 2;16;64;11095770;710129280;11096799;710195136 > 3;16;64;11097199;710220736;11097702;710252928 > 4;16;64;11080984;567081856;11079662;709098368 > 5;16;64;11077696;708972544;11077039;708930496 > 6;16;64;11082991;709311424;8864802;567347328 > 7;16;64;11089596;709734144;8870927;709789184 > 8;16;64;11094043;710018752;11095391;710105024 > > Mellanox with TX traffic on vlan: > ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX > 0;16;64;7369914;471674496;7370281;471697980 > 1;16;64;7368896;471609408;7368043;471554752 > 2;16;64;7367577;471524864;7367759;471536576 > 3;16;64;7368744;377305344;7369391;471641024 > 4;16;64;7366824;471476736;7364330;471237120 > 5;16;64;7368352;471574528;7367239;471503296 > 6;16;64;7367459;471517376;7367806;471539584 > 7;16;64;7367190;471500160;7367988;471551232 > 8;16;64;7368023;471553472;7368076;471556864 I wonder if the drivers page recycler is active/working or not, and if the situation is different between VLAN vs no-vlan (given page_frag_free is so high in you perf top). The Mellanox drivers fortunately have a stats counter to tell us this explicitly (which the ixgbe driver doesn't). You can use my ethtool_stats.pl script watch these stats: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl (Hint perl dependency: dnf install perl-Time-HiRes) > ethtool settings for both tests: > ifc='enp175s0f0 enp175s0f1' > for i in $ifc > do > ip link set up dev $i > ethtool -A $i autoneg off rx off tx off > ethtool -G $i rx 128 tx 256 The ring queue size recommendations, might be different for the mlx5 driver (Cc'ing Mellanox maintainers). > ip link set $i txqueuelen 1000 > ethtool -C $i rx-usecs 25 > ethtool -L $i combined 16 > ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off > tx-nocache-copy off ntuple on > ethtool -N $i rx-flow-hash udp4 sdfn > done Thanks for being explicit about what you setup is :-) > and perf top: > PerfTop: 83650 irqs/sec kernel:99.7% exact: 0.0% [4000Hz > cycles], (all, 56 CPUs) > --- > > 14.25% [kernel] [k] dst_release > 14.17% [kernel] [k] skb_dst_force > 13.41% [kernel] [k] rt_cache_valid > 11.47% [kernel] [k] ip_finish_output2 > 7.01% [kernel] [k] do_raw_spin_lock > 5.07% [kernel] [k] page_frag_free > 3.47% [mlx5_core][k] mlx5e_xmit > 2.88% [kernel] [k] fib_table_lookup > 2.43% [mlx5_core][k] skb_from_cqe.isra.32 > 1.97% [kernel] [k] virt_to_head_page > 1.81% [mlx5_core][k] mlx5e_poll_tx_cq > 0.93% [kernel] [k] __dev_queue_xmit > 0.87% [kernel] [k] __build_skb > 0.84% [kernel] [k] ipt_do_table > 0.79% [kernel] [k] ip_rcv > 0.79% [kernel] [k] acpi_processor_ffh_cstate_enter > 0.78% [kernel] [k] netif_skb_features > 0.73% [kernel] [k] __netif_receive_skb_core > 0.52% [kernel] [k] dev_hard_start_xmit > 0.52% [kernel] [k] build_skb > 0.51% [kernel] [k] ip_route_input_rcu > 0.50% [kernel] [k] skb_unref > 0.49% [kernel] [k] ip_forward > 0.48% [mlx5_core][k] mlx5_cqwq_get_cqe > 0.44% [kernel] [k] udp_v4_early_demux > 0.41% [kernel] [k] napi_consume_skb > 0.40% [kernel] [k] __local_bh_enable_ip > 0.39% [kernel] [k] ip_rcv_finish > 0.39% [kernel] [k] kmem_cache_alloc > 0.38% [kernel] [k] sch_direct_xmit > 0.33% [kernel] [k] validate_xmit_skb > 0.32% [mlx5_core][k] mlx5e_free_rx_wqe_reuse > 0.29% [kernel] [k] netdev_pick_tx >
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
With hack: 14.44% [kernel] [k] do_raw_spin_lock 8.30% [kernel] [k] page_frag_free 7.06% [mlx5_core][k] mlx5e_xmit 5.97% [kernel] [k] acpi_processor_ffh_cstate_enter 5.73% [kernel] [k] fib_table_lookup 4.81% [mlx5_core][k] mlx5e_poll_tx_cq 4.51% [mlx5_core][k] skb_from_cqe.isra.32 3.81% [kernel] [k] virt_to_head_page 2.45% [kernel] [k] __dev_queue_xmit 1.84% [kernel] [k] ipt_do_table 1.77% [kernel] [k] napi_consume_skb 1.62% [kernel] [k] __build_skb 1.46% [kernel] [k] netif_skb_features 1.43% [kernel] [k] __netif_receive_skb_core 1.41% [kernel] [k] ip_rcv 1.08% [kernel] [k] dev_hard_start_xmit 1.02% [kernel] [k] build_skb 1.00% [mlx5_core][k] mlx5_cqwq_get_cqe 0.96% [kernel] [k] ip_route_input_rcu 0.95% [kernel] [k] ip_forward 0.89% [kernel] [k] ip_finish_output2 0.89% [kernel] [k] kmem_cache_alloc 0.78% [kernel] [k] __local_bh_enable_ip 0.76% [kernel] [k] udp_v4_early_demux 0.75% [kernel] [k] compound_head 0.75% [kernel] [k] __netdev_pick_tx 0.73% [kernel] [k] sch_direct_xmit 0.65% [kernel] [k] irq_entries_start 0.63% [mlx5_core][k] mlx5e_free_rx_wqe_reuse 0.61% [kernel] [k] netdev_pick_tx 0.61% [kernel] [k] validate_xmit_skb 0.55% [kernel] [k] skb_network_protocol 0.53% [mlx5_core][k] mlx5e_rx_cache_get 0.53% [mlx5_core][k] mlx5e_build_rx_skb 0.51% [kernel] [k] ip_rcv_finish 0.50% [kernel] [k] eth_header 0.50% [kernel] [k] fib_validate_source 0.50% [mlx5_core][k] mlx5e_handle_rx_cqe 0.48% [mlx5_core][k] eq_update_ci 0.47% [kernel] [k] kmem_cache_free_bulk 0.44% [kernel] [k] deliver_ptype_list_skb 0.43% [kernel] [k] skb_release_data 0.42% [kernel] [k] cpuidle_enter_state 0.40% [kernel] [k] virt_to_head_page 0.39% [kernel] [k] vlan_dev_hard_start_xmit 0.39% [kernel] [k] neigh_connected_output 0.38% [kernel] [k] eth_type_vlan 0.35% [mlx5_core][k] mlx5e_alloc_rx_wqe 0.32% [kernel] [k] nf_hook_slow 0.32% [kernel] [k] swiotlb_map_page 0.31% [kernel] [k] ip_finish_output 0.29% [kernel] [k] ip_output 0.28% [kernel] [k] skb_free_head 0.25% [kernel] [k] netif_receive_skb_internal 0.25% [kernel] [k] __jhash_nwords Without hack: 14.25% [kernel] [k] dst_release 14.17% [kernel] [k] skb_dst_force 13.41% [kernel] [k] rt_cache_valid 11.47% [kernel] [k] ip_finish_output2 7.01% [kernel] [k] do_raw_spin_lock 5.07% [kernel] [k] page_frag_free 3.47% [mlx5_core][k] mlx5e_xmit 2.88% [kernel] [k] fib_table_lookup 2.43% [mlx5_core][k] skb_from_cqe.isra.32 1.97% [kernel] [k] virt_to_head_page 1.81% [mlx5_core][k] mlx5e_poll_tx_cq 0.93% [kernel] [k] __dev_queue_xmit 0.87% [kernel] [k] __build_skb 0.84% [kernel] [k] ipt_do_table 0.79% [kernel] [k] ip_rcv 0.79% [kernel] [k] acpi_processor_ffh_cstate_enter 0.78% [kernel] [k] netif_skb_features 0.73% [kernel] [k] __netif_receive_skb_core 0.52% [kernel] [k] dev_hard_start_xmit 0.52% [kernel] [k] build_skb 0.51% [kernel] [k] ip_route_input_rcu 0.50% [kernel] [k] skb_unref 0.49% [kernel] [k] ip_forward 0.48% [mlx5_core][k] mlx5_cqwq_get_cqe 0.44% [kernel] [k] udp_v4_early_demux 0.41% [kernel] [k] napi_consume_skb 0.40% [kernel] [k] __local_bh_enable_ip 0.39% [kernel] [k] ip_rcv_finish 0.39% [kernel] [k] kmem_cache_alloc 0.38% [kernel] [k] sch_direct_xmit 0.33% [kernel] [k] validate_xmit_skb 0.32% [mlx5_core][k] mlx5e_free_rx_wqe_reuse 0.29% [kernel] [k] netdev_pick_tx 0.28% [mlx5_core][k] mlx5e_build_rx_skb 0.27% [kernel] [k] deliver_ptype_list_skb 0.26% [kernel] [k] fib_validate_source 0.26% [mlx5_core][k] mlx5e_napi_poll 0.26% [mlx5_core][k] mlx5e_handle_rx_cqe 0.26% [mlx5_core][k] mlx5e_rx_cache_get 0.25% [kernel] [k] eth_header 0.23% [kernel] [k] skb_network_protocol 0.20% [kernel] [k] nf_hook_slow 0.20% [kernel] [k] vlan_passthru_hard_header 0.20% [kernel] [k] vlan_dev_hard_start_xmit 0.19% [kernel] [k] swiotlb_map_page 0.18% [kernel] [k] compound_head 0.18% [kernel] [k] neigh_connected_output 0.18% [mlx5_core]
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Hi Yes it helped - now there is almost no difference when using vlans or not: 10.5Mpps - with vlan 11Mpps - without vlan W dniu 2017-08-15 o 03:17, Eric Dumazet pisze: On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote: Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev. Something like : diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev= real_dev; + dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote: > Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev. Something like : diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id= nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; + dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Tue, 2017-08-15 at 02:45 +0200, Paweł Staszewski wrote: > > W dniu 2017-08-14 o 18:57, Paolo Abeni pisze: > > On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote: > >> The output (extracted below) didn't show who called 'do_raw_spin_lock', > >> BUT it showed another interesting thing. The kernel code > >> __dev_queue_xmit() in might create route dst-cache problem for itself(?), > >> as it will first call skb_dst_force() and then skb_dst_drop() when the > >> packet is transmitted on a VLAN. > >> > >> static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) > >> { > >> [...] > >>/* If device/qdisc don't need skb->dst, release it right now while > >> * its hot in this cpu cache. > >> */ > >>if (dev->priv_flags & IFF_XMIT_DST_RELEASE) > >>skb_dst_drop(skb); > >>else > >>skb_dst_force(skb); > > I think that the high impact of the above code in this specific test is > > mostly due to the following: > > > > - ingress packets with different RSS rx hash lands on different CPUs > yes but isn't this normal ? > everybody that want to ballance load over cores will try tu use as many > as possible :) > With some limitations ... best are 6 to 7 RSS queues - so need to use 6 > to 7 cpu cores > > > - but they use the same dst entry, since the destination IPs belong to > > the same subnet > typical for ddos - many sources one destination Nobody hit this issue yet. We usually change the kernel, given typical workloads. In this case, we might need per cpu nh_rth_input Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-14 o 18:57, Paolo Abeni pisze: On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote: The output (extracted below) didn't show who called 'do_raw_spin_lock', BUT it showed another interesting thing. The kernel code __dev_queue_xmit() in might create route dst-cache problem for itself(?), as it will first call skb_dst_force() and then skb_dst_drop() when the packet is transmitted on a VLAN. static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) { [...] /* If device/qdisc don't need skb->dst, release it right now while * its hot in this cpu cache. */ if (dev->priv_flags & IFF_XMIT_DST_RELEASE) skb_dst_drop(skb); else skb_dst_force(skb); I think that the high impact of the above code in this specific test is mostly due to the following: - ingress packets with different RSS rx hash lands on different CPUs yes but isn't this normal ? everybody that want to ballance load over cores will try tu use as many as possible :) With some limitations ... best are 6 to 7 RSS queues - so need to use 6 to 7 cpu cores - but they use the same dst entry, since the destination IPs belong to the same subnet typical for ddos - many sources one destination - the dst refcnt cacheline is contented between all the CPUs Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE flag for vlan if the underlaying device does not have (relevant) classifier attached? (and clearing it as needed) Paolo
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze: On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski wrote: To show some difference below comparision vlan/no-vlan traffic 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan I'm trying to reproduce in my testlab (with ixgbe). I do see, a performance reduction of about 10-19% when I forward out a VLAN interface. This is larger than I expected, but still lower than what you reported 30-40% slowdown. [...] Ok mellanox afrrived (MT27700 - mlnx5 driver) And to compare melannox with vlans and without: 33% performance degradation (less than with ixgbe where i reach ~40% with same settings) Mellanox without TX traffix on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;11089305;709715520;8871553;567779392 1;16;64;11096292;710162688;11095566;710116224 2;16;64;11095770;710129280;11096799;710195136 3;16;64;11097199;710220736;11097702;710252928 4;16;64;11080984;567081856;11079662;709098368 5;16;64;11077696;708972544;11077039;708930496 6;16;64;11082991;709311424;8864802;567347328 7;16;64;11089596;709734144;8870927;709789184 8;16;64;11094043;710018752;11095391;710105024 Mellanox with TX traffic on vlan: ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;7369914;471674496;7370281;471697980 1;16;64;7368896;471609408;7368043;471554752 2;16;64;7367577;471524864;7367759;471536576 3;16;64;7368744;377305344;7369391;471641024 4;16;64;7366824;471476736;7364330;471237120 5;16;64;7368352;471574528;7367239;471503296 6;16;64;7367459;471517376;7367806;471539584 7;16;64;7367190;471500160;7367988;471551232 8;16;64;7368023;471553472;7368076;471556864 ethtool settings for both tests: ifc='enp175s0f0 enp175s0f1' for i in $ifc do ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 128 tx 256 ip link set $i txqueuelen 1000 ethtool -C $i rx-usecs 25 ethtool -L $i combined 16 ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off tx-nocache-copy off ntuple on ethtool -N $i rx-flow-hash udp4 sdfn done and perf top: PerfTop: 83650 irqs/sec kernel:99.7% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 14.25% [kernel] [k] dst_release 14.17% [kernel] [k] skb_dst_force 13.41% [kernel] [k] rt_cache_valid 11.47% [kernel] [k] ip_finish_output2 7.01% [kernel] [k] do_raw_spin_lock 5.07% [kernel] [k] page_frag_free 3.47% [mlx5_core][k] mlx5e_xmit 2.88% [kernel] [k] fib_table_lookup 2.43% [mlx5_core][k] skb_from_cqe.isra.32 1.97% [kernel] [k] virt_to_head_page 1.81% [mlx5_core][k] mlx5e_poll_tx_cq 0.93% [kernel] [k] __dev_queue_xmit 0.87% [kernel] [k] __build_skb 0.84% [kernel] [k] ipt_do_table 0.79% [kernel] [k] ip_rcv 0.79% [kernel] [k] acpi_processor_ffh_cstate_enter 0.78% [kernel] [k] netif_skb_features 0.73% [kernel] [k] __netif_receive_skb_core 0.52% [kernel] [k] dev_hard_start_xmit 0.52% [kernel] [k] build_skb 0.51% [kernel] [k] ip_route_input_rcu 0.50% [kernel] [k] skb_unref 0.49% [kernel] [k] ip_forward 0.48% [mlx5_core][k] mlx5_cqwq_get_cqe 0.44% [kernel] [k] udp_v4_early_demux 0.41% [kernel] [k] napi_consume_skb 0.40% [kernel] [k] __local_bh_enable_ip 0.39% [kernel] [k] ip_rcv_finish 0.39% [kernel] [k] kmem_cache_alloc 0.38% [kernel] [k] sch_direct_xmit 0.33% [kernel] [k] validate_xmit_skb 0.32% [mlx5_core][k] mlx5e_free_rx_wqe_reuse 0.29% [kernel] [k] netdev_pick_tx 0.28% [mlx5_core][k] mlx5e_build_rx_skb 0.27% [kernel] [k] deliver_ptype_list_skb 0.26% [kernel] [k] fib_validate_source 0.26% [mlx5_core][k] mlx5e_napi_poll 0.26% [mlx5_core][k] mlx5e_handle_rx_cqe 0.26% [mlx5_core][k] mlx5e_rx_cache_get 0.25% [kernel] [k] eth_header 0.23% [kernel] [k] skb_network_protocol 0.20% [kernel] [k] nf_hook_slow 0.20% [kernel] [k] vlan_passthru_hard_header 0.20% [kernel] [k] vlan_dev_hard_start_xmit 0.19% [kernel] [k] swiotlb_map_page 0.18% [kernel] [k] compound_head 0.18% [kernel] [k] neigh_connected_output 0.18% [mlx5_core][k] mlx5e_alloc_rx_wqe 0.18% [kernel] [k] ip_output 0.17% [kernel] [k] prefetch_freepointer.isra.70 0.17% [kernel] [k] __slab_free 0.16% [kernel] [k] eth_type_vlan 0.16% [kernel]
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote: > The output (extracted below) didn't show who called 'do_raw_spin_lock', > BUT it showed another interesting thing. The kernel code > __dev_queue_xmit() in might create route dst-cache problem for itself(?), > as it will first call skb_dst_force() and then skb_dst_drop() when the > packet is transmitted on a VLAN. > > static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) > { > [...] > /* If device/qdisc don't need skb->dst, release it right now while >* its hot in this cpu cache. >*/ > if (dev->priv_flags & IFF_XMIT_DST_RELEASE) > skb_dst_drop(skb); > else > skb_dst_force(skb); I think that the high impact of the above code in this specific test is mostly due to the following: - ingress packets with different RSS rx hash lands on different CPUs - but they use the same dst entry, since the destination IPs belong to the same subnet - the dst refcnt cacheline is contented between all the CPUs Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE flag for vlan if the underlaying device does not have (relevant) classifier attached? (and clearing it as needed) Paolo
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote: > The output (extracted below) didn't show who called 'do_raw_spin_lock', > BUT it showed another interesting thing. The kernel code > __dev_queue_xmit() in might create route dst-cache problem for itself(?), > as it will first call skb_dst_force() and then skb_dst_drop() when the > packet is transmitted on a VLAN. > > static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) > { > [...] > /* If device/qdisc don't need skb->dst, release it right now while >* its hot in this cpu cache. >*/ > if (dev->priv_flags & IFF_XMIT_DST_RELEASE) > skb_dst_drop(skb); > else > skb_dst_force(skb); This is explained in this commit changelog. https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=93f154b594fe47e4a7e5358b309add449a046cd3
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski wrote: > To show some difference below comparision vlan/no-vlan traffic > > 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan I'm trying to reproduce in my testlab (with ixgbe). I do see, a performance reduction of about 10-19% when I forward out a VLAN interface. This is larger than I expected, but still lower than what you reported 30-40% slowdown. [...] > >>> perf top: > >>> > >>>PerfTop: 77835 irqs/sec kernel:99.7% > >>> - > >>> > >>> 16.32% [kernel] [k] skb_dst_force > >>> 16.30% [kernel] [k] dst_release > >>> 15.11% [kernel] [k] rt_cache_valid > >>> 12.62% [kernel] [k] ipv4_mtu > >> It seems a little strange that these 4 functions are on the top I don't see these in my test. > >> > >>>5.60% [kernel] [k] do_raw_spin_lock > >> Why is calling/taking this lock? (Use perf call-graph recording). > > can be hard to paste it here:) > > attached file The attached was very big. Please don't attach so big file on mailing lists. Next time plase share them via e.g. pastebin. The output was a capture from your terminal, which made the output more difficult to read. Hint: You can/could use perf --stdio and place it in a file instead. The output (extracted below) didn't show who called 'do_raw_spin_lock', BUT it showed another interesting thing. The kernel code __dev_queue_xmit() in might create route dst-cache problem for itself(?), as it will first call skb_dst_force() and then skb_dst_drop() when the packet is transmitted on a VLAN. static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) { [...] /* If device/qdisc don't need skb->dst, release it right now while * its hot in this cpu cache. */ if (dev->priv_flags & IFF_XMIT_DST_RELEASE) skb_dst_drop(skb); else skb_dst_force(skb); - - Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer Extracted part of attached perf output: --5.37%--ip_rcv_finish | |--4.02%--ip_forward | | |--3.92%--ip_forward_finish | | |--3.91%--ip_output | | | --3.90%--ip_finish_output | | | --3.88%--ip_finish_output2 | | | --2.77%--neigh_connected_output | | | --2.74%--dev_queue_xmit | | | --2.73%--__dev_queue_xmit | | | |--1.66%--dev_hard_start_xmit | | | | |--1.64%--vlan_dev_hard_start_xmit | | | | |--1.63%--dev_queue_xmit | | | | |--1.62%--__dev_queue_xmit | | | | | |--0.99%--skb_dst_drop.isra.77 | | | | | | | --0.99%--dst_release | | | | |--0.55%--sch_direct_xmit | | | --0.99%--skb_dst_force | --1.29%--ip_route_input_noref | --1.29%--ip_route_input_rcu | --1.05%--rt_cache_valid
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-14 o 02:07, Alexander Duyck pisze: On Sat, Aug 12, 2017 at 10:27 AM, Paweł Staszewski wrote: Hi and thanks for reply W dniu 2017-08-12 o 14:23, Jesper Dangaard Brouer pisze: On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski wrote: Hi I made some tests for performance comparison. Thanks for doing this. Feel free to Cc me, if you do more of these tests (so I don't miss them on the mailing list). I don't understand stand if you are reporting a potential problem? It would be good if you can provide a short summary section (of the issue) in the _start_ of the email, and then provide all this nice data afterwards, to back your case. My understanding is, you report: 1. VLANs on ixgbe show a 30-40% slowdown 2. System stopped scaling after 7+ CPUs So I had read through most of this before I realized what it was you were reporting. As far as the behavior there are a few things going on. I have some additional comments below but they are mostly based on what I had read up to that point. As far as possible issues for item 1. The VLAN adds 4 bytes of data of the payload, when it is stripped it can result in a packet that is 56 bytes. These missing 8 bytes can cause issues as it forces the CPU to do a read/modify/write every time the device writes to the 64B cache line instead of just doing it as a single write. This can be very expensive and hurt performance. In addition it adds 4 bytes on the wire, so if you are sending the same 64B packets over the VLAN interface it is bumping them up to 68B to make room for the VLAN tag. I am suspecting you are encountering one of these type of issues. You might try tweaking the packet sizes in increments of 4 to see if there is a sweet spot that you might be falling out of or into. No this is not a problem with 4byte header or soo Cause topology is like this TX generator (pktgen) physical interface no vlan -> RX physical interface (no vlan) [ FORWARDING HOST ] TX vlan interface binded to physical interface -> SINK below data for packet size 70 (pktgen PKT_SIZE: 70) ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;70;7246720;434749440;7245917;420269856 1;16;70;7249152;434872320;7248885;420434344 2;16;70;7249024;434926080;7249225;420401400 3;16;70;7249984;434952960;7249448;420435736 4;16;70;7251200;435064320;7250990;420495244 5;16;70;7241408;434592000;7241781;420068074 6;16;70;7229696;433689600;7229750;419268196 7;16;70;7236032;434127360;7236133;419669092 8;16;70;7236608;434161920;7236274;419695830 9;16;70;7226496;433578240;7227501;419107826 100% cpu load on all 16 cores the difference vlan/no vlan currently on this host varries from 40 to even 50% (but cant check if can reach 50% performance degradation cause pktgen can give me only 10Mpps with 70% of cpu load for forwarding host (soo still place to forward maybee at line rate 14Mpps) Item 2 is a known issue with the NICs supported by ixgbe, at least for anything 82599 and later. The issue here is that there isn't really an Rx descriptor cache so to try and optimize performance the hardware will try to write back as many descriptors it has ready for the ring requesting writeback. The problem is as you add more rings it means the writes get smaller as they are triggering more often. So what you end up seeing is that for each additional ring you add the performance starts dropping as soon as the rings are no longer being fully saturated. You can tell this has happened when the CPUs in use suddenly all stop reporting 100% softirq use. So for example to perform at line rate with 64B packets you would need something like XDP and to keep the ring count small, like maybe 2 rings. Any more than that and the performance will start to drop as you hit PCIe bottlenecks. This is not only problem/bug report - but some kind of comparision plus some toughts about possible problems :) And can help somebody when searching the net for possible expectations :) Also - dono better list where are the smartest people that know what is going in kernel with networking :) Next time i will place summary on top - sorry :) Tested HW (FORWARDING HOST): Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz Interesting, I've not heard about a Intel CPU called "Gold" before now, but it does exist: https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) This is one of my all time favorite NICs! Yes this is a good NIC - will have connectx-4 2x100G by monday so will also do some tests Test diagram: TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK Forwarder traffic: UDP random ports from 9 to 19 with random hosts from 172.16.0.1 to 172.16.0.255 TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen) What kind of traffic flow? E.g. distribution, many/few source IPs... Traffic generator is pktgen so udp flows
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Sat, Aug 12, 2017 at 10:27 AM, Paweł Staszewski wrote: > Hi and thanks for reply > > > > W dniu 2017-08-12 o 14:23, Jesper Dangaard Brouer pisze: >> >> On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski >> wrote: >> >>> Hi >>> >>> I made some tests for performance comparison. >> >> Thanks for doing this. Feel free to Cc me, if you do more of these >> tests (so I don't miss them on the mailing list). >> >> I don't understand stand if you are reporting a potential problem? >> >> It would be good if you can provide a short summary section (of the >> issue) in the _start_ of the email, and then provide all this nice data >> afterwards, to back your case. >> >> My understanding is, you report: >> >> 1. VLANs on ixgbe show a 30-40% slowdown >> 2. System stopped scaling after 7+ CPUs So I had read through most of this before I realized what it was you were reporting. As far as the behavior there are a few things going on. I have some additional comments below but they are mostly based on what I had read up to that point. As far as possible issues for item 1. The VLAN adds 4 bytes of data of the payload, when it is stripped it can result in a packet that is 56 bytes. These missing 8 bytes can cause issues as it forces the CPU to do a read/modify/write every time the device writes to the 64B cache line instead of just doing it as a single write. This can be very expensive and hurt performance. In addition it adds 4 bytes on the wire, so if you are sending the same 64B packets over the VLAN interface it is bumping them up to 68B to make room for the VLAN tag. I am suspecting you are encountering one of these type of issues. You might try tweaking the packet sizes in increments of 4 to see if there is a sweet spot that you might be falling out of or into. Item 2 is a known issue with the NICs supported by ixgbe, at least for anything 82599 and later. The issue here is that there isn't really an Rx descriptor cache so to try and optimize performance the hardware will try to write back as many descriptors it has ready for the ring requesting writeback. The problem is as you add more rings it means the writes get smaller as they are triggering more often. So what you end up seeing is that for each additional ring you add the performance starts dropping as soon as the rings are no longer being fully saturated. You can tell this has happened when the CPUs in use suddenly all stop reporting 100% softirq use. So for example to perform at line rate with 64B packets you would need something like XDP and to keep the ring count small, like maybe 2 rings. Any more than that and the performance will start to drop as you hit PCIe bottlenecks. > This is not only problem/bug report - but some kind of comparision plus > some toughts about possible problems :) > And can help somebody when searching the net for possible expectations :) > Also - dono better list where are the smartest people that know what is > going in kernel with networking :) > > Next time i will place summary on top - sorry :) > >> >>> Tested HW (FORWARDING HOST): >>> >>> Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz >> >> Interesting, I've not heard about a Intel CPU called "Gold" before now, >> but it does exist: >> >> https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz >> >> >>> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) >> >> This is one of my all time favorite NICs! > > Yes this is a good NIC - will have connectx-4 2x100G by monday so will also > do some tests > >> >>> >>> Test diagram: >>> >>> >>> TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST >>> (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK >>> >>> Forwarder traffic: UDP random ports from 9 to 19 with random hosts from >>> 172.16.0.1 to 172.16.0.255 >>> >>> TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen) >> >> What kind of traffic flow? E.g. distribution, many/few source IPs... > > > Traffic generator is pktgen so udp flows - better paste parameters from > pktgen: > UDP_MIN=9 > UDP_MAX=19 > > pg_set $dev "dst_min 172.16.0.1" > pg_set $dev "dst_max 172.16.0.100" > > # Setup random UDP port src range > #pg_set $dev "flag UDPSRC_RND" > pg_set $dev "flag UDPSRC_RND" > pg_set $dev "udp_src_min $UDP_MIN" > pg_set $dev "udp_src_max $UDP_MAX" > > >> >> >>> >>> Settings used for FORWARDING HOST (changed param. was only number of RSS >>> combined queues + set affinity assignment for them to fit with first >>> numa node where 2x10G port card is installed) >>> >>> ixgbe driver used from kernel (in-kernel build - not a module) >>> >> Nice with a script showing you setup, thanks. I would be good if it had >> comments, telling why you think this is a needed setup adjustment. >> >>> #!/bin/sh >>> ifc='enp216s0f0 enp216s0f1' >>> for i in $ifc >>> do >>> ip link set up dev $i >>> ethtool -A $i autoneg off rx off tx off >> >> Good: >> Turning off
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
To show some difference below comparision vlan/no-vlan traffic 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan (ixgbe in kernel driver kernel 4.13.0-rc4-next-20170811) ethtool settings for both tests: ethtool -K $ifc gro off tso off gso off sg on l2-fwd-offload off tx-nocache-copy off ntuple off ethtool -L $ifc combined 16 ethtool -C $ifc rx-usecs 2 ethtool -G $ifc rx 4096 tx 1024 16 CORES / 16 RSS QUEUES Tx traffic on vlan: RX Interface: enp216s0f0 TX Interface vlan1000 added to enp216s0f1 interface (with vlan 1000 ip address assigned) ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;6939008;416325120;6938696;402411192 1;16;64;6941952;416444160;6941745;402558918 2;16;64;6960576;417584640;6960707;403698718 3;16;64;6940736;416486400;6941820;402503876 4;16;64;6927680;415741440;6927420;401853870 5;16;64;6929792;415687680;6929917;401839196 6;16;64;6950400;416989440;6950661;403026166 7;16;64;6953664;417216000;6953454;403260544 8;16;64;6948480;416851200;6948800;403023266 9;16;64;6924160;415422720;6924092;401542468 100% load on all 16 Cores. vs RX interface from traffic generator: enp216s0f0 TX interface to the sink: enp216s0f1 No vlan used ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX 0;16;64;10280176;793608540;10298496;596796568 1;16;64;10046928;600978780;10046022;582527002 2;16;64;10032956;601827420;10026097;581515656 3;16;64;10051503;602252460;10067880;582420804 4;16;64;10016204;602725140;10017358;582644800 5;16;64;10035575;602437620;10059504;582067294 6;16;64;10041667;603069780;10057865;582477412 7;16;64;1008;600027420;10046526;581022018 8;16;64;10022436;601121100;10025946;581904314 9;16;64;10036231;602514960;10058724;582180684 So we have 10Mpps forwarded - have problems with pktgen on my traffic generator to push more than 10M but this low budget hardware so.. :) And there are still free cpu cycles so probabbly can forward at line 10G rate 14Mpps Average: CPU%usr %nice%sys %iowait%irq %soft %steal %guest %gnice %idle Average: all0.000.000.000.000.00 20.91 0.000.000.00 79.09 Average: 00.000.000.000.000.00 0.090.00 0.000.00 99.91 Average: 10.030.000.030.000.00 0.000.00 0.000.00 99.94 Average: 20.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 30.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 40.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 50.000.000.180.000.00 0.000.00 0.000.00 99.82 Average: 60.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 70.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 80.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 90.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 100.000.000.030.240.00 0.000.00 0.000.00 99.74 Average: 110.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 120.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 130.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 140.000.000.000.000.00 92.38 0.000.000.007.62 Average: 150.000.000.000.000.00 85.88 0.000.000.00 14.12 Average: 160.000.000.000.000.00 64.91 0.000.000.00 35.09 Average: 170.000.000.000.000.00 66.76 0.000.000.00 33.24 Average: 180.000.000.000.000.00 65.57 0.000.000.00 34.43 Average: 190.000.000.000.000.00 66.38 0.000.000.00 33.62 Average: 200.000.000.000.000.00 72.97 0.000.000.00 27.03 Average: 210.000.000.000.000.00 70.80 0.000.000.00 29.20 Average: 220.000.000.000.000.00 66.44 0.000.000.00 33.56 Average: 230.000.000.000.000.00 66.12 0.000.000.00 33.88 Average: 240.000.000.000.000.00 68.35 0.000.000.00 31.65 Average: 250.000.000.000.000.00 71.79 0.000.000.00 28.21 Average: 260.000.000.000.000.00 70.24 0.000.000.00 29.76 Average: 270.000.000.000.000.00 73.24 0.000.000.00 26.76 Average: 280.000.000.000.000.00 0.000.00 0.000.00 100.00 Average: 290.000.000.000.000.00 0.000.
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski wrote: > Hi > > I made some tests for performance comparison. Thanks for doing this. Feel free to Cc me, if you do more of these tests (so I don't miss them on the mailing list). I don't understand stand if you are reporting a potential problem? It would be good if you can provide a short summary section (of the issue) in the _start_ of the email, and then provide all this nice data afterwards, to back your case. My understanding is, you report: 1. VLANs on ixgbe show a 30-40% slowdown 2. System stopped scaling after 7+ CPUs > Tested HW (FORWARDING HOST): > > Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz Interesting, I've not heard about a Intel CPU called "Gold" before now, but it does exist: https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz > Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) This is one of my all time favorite NICs! > Test diagram: > > > TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST > (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK > > Forwarder traffic: UDP random ports from 9 to 19 with random hosts from > 172.16.0.1 to 172.16.0.255 > > TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen) What kind of traffic flow? E.g. distribution, many/few source IPs... > Settings used for FORWARDING HOST (changed param. was only number of RSS > combined queues + set affinity assignment for them to fit with first > numa node where 2x10G port card is installed) > > ixgbe driver used from kernel (in-kernel build - not a module) > Nice with a script showing you setup, thanks. I would be good if it had comments, telling why you think this is a needed setup adjustment. > #!/bin/sh > ifc='enp216s0f0 enp216s0f1' > for i in $ifc > do > ip link set up dev $i > ethtool -A $i autoneg off rx off tx off Good: Turning off Ethernet flow control, to avoid receiver being the bottleneck via pause-frames. > ethtool -G $i rx 4096 tx 1024 You adjust the RX and TX ring queue sizes, this have effects that you don't realize. Especially for the ixgbe driver, which have a page recycle trick tied to the RX ring queue size. > ip link set $i txqueuelen 1000 Setting tx queue len to the default 1000 seems redundant. > ethtool -C $i rx-usecs 10 Adjusting this also have effects you might not realize. This actually also affect the page recycle scheme of ixgbe. And can sometimes be used to solve stalling on DMA TX completions, which could be you issue here. > ethtool -L $i combined 16 > ethtool -K $i gro on tso on gso off sg on l2-fwd-offload off > tx-nocache-copy on ntuple on Here are many setting above. GRO/GSO/TSO for _forwarding_ is actually bad... in my tests, enabling this result in approx 10% slowdown. AFAIK "tx-nocache-copy on" was also determined to be a bad option. The "ntuple on" AFAIK disables the flow-director in the NIC. I though this would actually help VLAN traffic, but I guess not. > ethtool -N $i rx-flow-hash udp4 sdfn Why do you change the NICs flow-hash? > done > > ip link set up dev enp216s0f0 > ip link set up dev enp216s0f1 > > ip a a 10.0.0.1/30 dev enp216s0f0 > > ip link add link enp216s0f1 name vlan1000 type vlan id 1000 > ip link set up dev vlan1000 > ip a a 10.0.0.5/30 dev vlan1000 > > > ip route add 172.16.0.0/12 via 10.0.0.6 > > ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f0 > ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f1 > #cat /sys/devices/system/node/node1/cpulist > #14-27,42-55 > #cat /sys/devices/system/node/node0/cpulist > #0-13,28-41 Is this a NUMA system? > # > > > Looks like forwarding performance when using vlans on ixgbe is less that > without vlans for about 30-40% (wondering if this is some vlan > offloading problem and ixgbe) I would see this as a problem/bug that enabling VLANs cost this much. > settings below: > > ethtool -k enp216s0f0 > Features for enp216s0f0: > Cannot get device udp-fragmentation-offload settings: Operation not > supported > rx-checksumming: on > tx-checksumming: on > tx-checksum-ipv4: off [fixed] > tx-checksum-ip-generic: on > tx-checksum-ipv6: off [fixed] > tx-checksum-fcoe-crc: off [fixed] > tx-checksum-sctp: on > scatter-gather: on > tx-scatter-gather: on > tx-scatter-gather-fraglist: off [fixed] > tcp-segmentation-offload: on > tx-tcp-segmentation: on > tx-tcp-ecn-segmentation: off [fixed] > tx-tcp-mangleid-segmentation: on > tx-tcp6-segmentation: on > udp-fragmentation-offload: off > generic-segmentation-offload: off > generic-receive-offload: on > large-receive-offload: off > rx-vlan-offload: on > tx-vlan-offload: on > ntuple-filters: on > receive-hashing: on > highdma: on [fixed] > rx-vlan-filter: on > v
Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Hi I made some tests for performance comparison. Tested HW (FORWARDING HOST): Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Test diagram: TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK Forwarder traffic: UDP random ports from 9 to 19 with random hosts from 172.16.0.1 to 172.16.0.255 TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen) Settings used for FORWARDING HOST (changed param. was only number of RSS combined queues + set affinity assignment for them to fit with first numa node where 2x10G port card is installed) ixgbe driver used from kernel (in-kernel build - not a module) #!/bin/sh ifc='enp216s0f0 enp216s0f1' for i in $ifc do ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 4096 tx 1024 ip link set $i txqueuelen 1000 ethtool -C $i rx-usecs 10 ethtool -L $i combined 16 ethtool -K $i gro on tso on gso off sg on l2-fwd-offload off tx-nocache-copy on ntuple on ethtool -N $i rx-flow-hash udp4 sdfn done ip link set up dev enp216s0f0 ip link set up dev enp216s0f1 ip a a 10.0.0.1/30 dev enp216s0f0 ip link add link enp216s0f1 name vlan1000 type vlan id 1000 ip link set up dev vlan1000 ip a a 10.0.0.5/30 dev vlan1000 ip route add 172.16.0.0/12 via 10.0.0.6 ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f0 ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f1 #cat /sys/devices/system/node/node1/cpulist #14-27,42-55 #cat /sys/devices/system/node/node0/cpulist #0-13,28-41 # Looks like forwarding performance when using vlans on ixgbe is less that without vlans for about 30-40% (wondering if this is some vlan offloading problem and ixgbe) settings below: ethtool -k enp216s0f0 Features for enp216s0f0: Cannot get device udp-fragmentation-offload settings: Operation not supported rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: on tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: on loopback: off [fixed] rx-fcs: off [fixed] rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on Another thing is that forwarding performance does not scale with number of cores when 7+ cores are reached perf top: PerfTop: 77835 irqs/sec kernel:99.7% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 16.32% [kernel] [k] skb_dst_force 16.30% [kernel] [k] dst_release 15.11% [kernel] [k] rt_cache_valid 12.62% [kernel] [k] ipv4_mtu 5.60% [kernel] [k] do_raw_spin_lock 3.03% [kernel] [k] fib_table_lookup 2.70% [kernel] [k] ip_finish_output2 2.10% [kernel] [k] dev_gro_receive 1.89% [kernel] [k] eth_type_trans 1.81% [kernel] [k] ixgbe_poll 1.15% [kernel] [k] ixgbe_xmit_frame_ring 1.06% [kernel] [k] __build_skb 1.04% [kernel] [k] __dev_queue_xmit 0.97% [kernel] [k] ip_rcv 0.78% [kernel] [k] netif_skb_features 0.74% [kernel] [k] ipt_do_table 0.70% [kernel] [k] acpi_processor_ffh_cstate_enter 0.64% [kernel] [k] ip_forward 0.59% [kernel] [k] __netif_receive_skb_core 0.55% [kernel] [k] dev_hard_start_xmit 0.53% [kernel] [k] ip_route_input_rcu 0.53% [kernel] [k] ip_rcv_finish 0.51% [kernel] [k] pa