Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-10-18 Thread Paweł Staszewski



W dniu 2017-10-18 o 23:54, Eric Dumazet pisze:

On Wed, 2017-10-18 at 23:49 +0200, Paweł Staszewski wrote:


How far it is from applying this to the kernel ?

So far im using this on all my servers from about 3 months now without
problems

It is a hack, and does not support properly bonding/team.

( If the real_dev->privflags IFF_XMIT_DST_RELEASE bit changes,
  we want to update all the vlans at the same time )

We need something more sophisticated, and I had no time to spend on
this topic recently.






ok



Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-10-18 Thread Eric Dumazet
On Wed, 2017-10-18 at 23:49 +0200, Paweł Staszewski wrote:

> How far it is from applying this to the kernel ?
> 
> So far im using this on all my servers from about 3 months now without 
> problems

It is a hack, and does not support properly bonding/team.

( If the real_dev->privflags IFF_XMIT_DST_RELEASE bit changes,
 we want to update all the vlans at the same time )

We need something more sophisticated, and I had no time to spend on
this topic recently.






Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-10-18 Thread Paweł Staszewski

W dniu 2017-09-21 o 23:41, Florian Fainelli pisze:

On 09/21/2017 02:26 PM, Paweł Staszewski wrote:


W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
struct net_device *dev,
   vlan->vlan_proto = proto;
   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
   vlan->real_dev = real_dev;
+dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
   vlan->flags = VLAN_FLAG_REORDER_HDR;
 err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
vlan->vlan_id);

Any plans for this patch to go normal into the kernel ?

Would not this apply to pretty much any stacked device setup though? It
seems like any network device that just queues up its packet on another
physical device for actual transmission may need that (e.g: DSA, bond,
team, more.?)


How far it is from applying this to the kernel ?

So far im using this on all my servers from about 3 months now without 
problems




Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Eric Dumazet
On Thu, 2017-09-21 at 15:07 -0700, Florian Fainelli wrote:
> On 09/21/2017 02:54 PM, Eric Dumazet wrote:
> > On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote:
> > 
> >> Would not this apply to pretty much any stacked device setup though? It
> >> seems like any network device that just queues up its packet on another
> >> physical device for actual transmission may need that (e.g: DSA, bond,
> >> team, more.?)
> > 
> > We support bonding and team already.
> 
> Right, so that seems to mostly leave us with DSA at least. What about
> other devices that also have IFF_NO_QUEUE set?

It wont work.

loopback has IFF_NO_QUEUE, but you need to keep dst on skbs...





Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Florian Fainelli
On 09/21/2017 02:54 PM, Eric Dumazet wrote:
> On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote:
> 
>> Would not this apply to pretty much any stacked device setup though? It
>> seems like any network device that just queues up its packet on another
>> physical device for actual transmission may need that (e.g: DSA, bond,
>> team, more.?)
> 
> We support bonding and team already.

Right, so that seems to mostly leave us with DSA at least. What about
other devices that also have IFF_NO_QUEUE set?
-- 
Florian


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Eric Dumazet
On Thu, 2017-09-21 at 14:41 -0700, Florian Fainelli wrote:

> Would not this apply to pretty much any stacked device setup though? It
> seems like any network device that just queues up its packet on another
> physical device for actual transmission may need that (e.g: DSA, bond,
> team, more.?)

We support bonding and team already.




Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 23:41, Florian Fainelli pisze:

On 09/21/2017 02:26 PM, Paweł Staszewski wrote:


W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
struct net_device *dev,
   vlan->vlan_proto = proto;
   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
   vlan->real_dev = real_dev;
+dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
   vlan->flags = VLAN_FLAG_REORDER_HDR;
 err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
vlan->vlan_id);

Any plans for this patch to go normal into the kernel ?

Would not this apply to pretty much any stacked device setup though? It
seems like any network device that just queues up its packet on another
physical device for actual transmission may need that (e.g: DSA, bond,
team, more.?)

Some devices libe bond have it.

Just maybee when there was first patch vlans were not taken into account.
Did not checked all :)

But I know Eric will do :)








Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Florian Fainelli
On 09/21/2017 02:26 PM, Paweł Staszewski wrote:
> 
> 
> W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:
>> diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
>> index
>> 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
>> 100644
>> --- a/net/8021q/vlan_netlink.c
>> +++ b/net/8021q/vlan_netlink.c
>> @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
>> struct net_device *dev,
>>   vlan->vlan_proto = proto;
>>   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
>>   vlan->real_dev = real_dev;
>> +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
>>   vlan->flags = VLAN_FLAG_REORDER_HDR;
>> err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
>> vlan->vlan_id); 
> 
> Any plans for this patch to go normal into the kernel ?

Would not this apply to pretty much any stacked device setup though? It
seems like any network device that just queues up its packet on another
physical device for actual transmission may need that (e.g: DSA, bond,
team, more.?)
-- 
Florian


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 23:34, Eric Dumazet pisze:

On Thu, 2017-09-21 at 23:26 +0200, Paweł Staszewski wrote:

W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
struct net_device *dev,
   vlan->vlan_proto = proto;
   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
   vlan->real_dev = real_dev;
+dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
   vlan->flags = VLAN_FLAG_REORDER_HDR;
 err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
vlan->vlan_id);

Any plans for this patch to go normal into the kernel ?

So far im using it for about 3 weeks on all my linux based routers - and
no problems.

Yes, I was about to submit it, as I mentioned it few hours ago to you ;)






Yes i saw Your point 2)  in previous emails :)
But there was no patch in previous reply for this so was thinking that 
maybee too many things to do and You forgot about it :)


Thanks
Paweł



Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Eric Dumazet
On Thu, 2017-09-21 at 23:26 +0200, Paweł Staszewski wrote:
> 
> W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:
> > diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
> > index 
> > 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
> >  
> > 100644
> > --- a/net/8021q/vlan_netlink.c
> > +++ b/net/8021q/vlan_netlink.c
> > @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, 
> > struct net_device *dev,
> >   vlan->vlan_proto = proto;
> >   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
> >   vlan->real_dev = real_dev;
> > +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
> >   vlan->flags = VLAN_FLAG_REORDER_HDR;
> > err = vlan_check_real_dev(real_dev, vlan->vlan_proto, 
> > vlan->vlan_id); 
> 
> Any plans for this patch to go normal into the kernel ?
> 
> So far im using it for about 3 weeks on all my linux based routers - and 
> no problems.

Yes, I was about to submit it, as I mentioned it few hours ago to you ;)





Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-21 Thread Paweł Staszewski



W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 
100644

--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, 
struct net_device *dev,

  vlan->vlan_proto = proto;
  vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
  vlan->real_dev = real_dev;
+    dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
  vlan->flags = VLAN_FLAG_REORDER_HDR;
    err = vlan_check_real_dev(real_dev, vlan->vlan_proto, 
vlan->vlan_id); 


Any plans for this patch to go normal into the kernel ?

So far im using it for about 3 weeks on all my linux based routers - and 
no problems.




Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-11 Thread Paweł Staszewski

Another test for this patch with linux-next tree

with patch:

 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  - iface   Rx Tx    Total
==
 vlan1004:    1.00 P/s    606842.31 P/s 606843.31 P/s
   lo:    0.00 P/s 0.00 P/s 
0.00 P/s

 vlan1016:    0.00 P/s    607730.56 P/s 607730.56 P/s
 vlan1020:    0.00 P/s    606891.25 P/s 606891.25 P/s
 vlan1018:    0.00 P/s    607580.88 P/s 607580.88 P/s
 vlan1014:    0.00 P/s    607606.81 P/s 607606.81 P/s
 vlan1005:    0.00 P/s    606788.44 P/s 606788.44 P/s
 enp2s0f0:    2.00 P/s 2.00 P/s 
3.99 P/s

 vlan1017:    0.00 P/s    607643.75 P/s 607643.75 P/s
 enp132s0: 13079658.00 P/s 0.00 P/s 13079658.00 P/s
 vlan1000:    0.00 P/s    604409.19 P/s 604409.19 P/s
 vlan1010:    0.00 P/s    606984.06 P/s 606984.06 P/s
 vlan1019:    0.00 P/s    607452.12 P/s 607452.12 P/s
 vlan1008:    0.00 P/s    606803.44 P/s 606803.44 P/s
 vlan1011:    0.00 P/s    607048.94 P/s 607048.94 P/s
 vlan1001:    0.00 P/s    606773.50 P/s 606773.50 P/s
 vlan1006:    0.00 P/s    606811.38 P/s 606811.38 P/s
 vlan1012:    0.00 P/s    607051.94 P/s 607051.94 P/s
 vlan1013:    0.00 P/s    607067.88 P/s 607067.88 P/s
   enp4s0:    2.00 P/s  13020803.00 P/s 13020805.00 P/s
 vlan1007:    0.00 P/s    606798.44 P/s 606798.44 P/s
 vlan1002:    0.00 P/s    606840.31 P/s 606840.31 P/s
 vlan1009:    0.00 P/s    606809.38 P/s 606809.38 P/s
 enp2s0f1:  100.80 P/s 0.00 P/s   
100.80 P/s

 vlan1015:    0.00 P/s    607089.81 P/s 607089.81 P/s
 vlan1003:    1.00 P/s    606928.19 P/s 606929.19 P/s
--
    total: 13079765.00 P/s  25766758.00 P/s 38846524.00 P/s


13Mpps forwarded (32cores active for two mlx5 nics)

80% cpu load (20%idle all cores)


   PerfTop:  126552 irqs/sec  kernel:99.3%  exact:  0.0% [4000Hz 
cycles],  (all, 32 CPUs)

---

 8.25%  [kernel]   [k] fib_table_lookup
 7.98%  [kernel]   [k] do_raw_spin_lock
 6.20%  [kernel]   [k] mlx5e_handle_rx_cqe_mpwrq
 4.21%  [kernel]   [k] mlx5e_xmit
 3.37%  [kernel]   [k] __dev_queue_xmit
 2.95%  [kernel]   [k] ip_rcv
 2.72%  [kernel]   [k] ipt_do_table
 2.24%  [kernel]   [k] ip_finish_output2
 2.22%  [kernel]   [k] __netif_receive_skb_core
 2.17%  [kernel]   [k] ip_forward
 2.15%  [kernel]   [k] __build_skb
 1.99%  [kernel]   [k] ip_route_input_rcu
 1.70%  [kernel]   [k] mlx5e_txwqe_complete
 1.54%  [kernel]   [k] dev_gro_receive
 1.45%  [kernel]   [k] mlx5_cqwq_get_cqe
 1.38%  [kernel]   [k] udp_v4_early_demux
 1.35%  [kernel]   [k] netif_skb_features
 1.33%  [kernel]   [k] inet_gro_receive
 1.29%  [kernel]   [k] dev_hard_start_xmit
 1.27%  [kernel]   [k] ip_rcv_finish
 1.19%  [kernel]   [k] mlx5e_build_rx_skb
 1.15%  [kernel]   [k] __netdev_pick_tx
 1.11%  [kernel]   [k] kmem_cache_alloc
 1.09%  [kernel]   [k] mlx5e_poll_tx_cq
 1.07%  [kernel]   [k] mlx5e_txwqe_build_dsegs
 1.00%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.90%  [kernel]   [k] __napi_alloc_skb
 0.87%  [kernel]   [k] validate_xmit_skb
 0.87%  [kernel]   [k] read_tsc
 0.83%  [kernel]   [k] napi_gro_receive
 0.79%  [kernel]   [k] skb_network_protocol
 0.79%  [kernel]   [k] sch_direct_xmit
 0.78%  [kernel]   [k] __local_bh_enable_ip
 0.78%  [kernel]   [k] netdev_pick_tx
 0.75%  [kernel]   [k] __udp4_lib_lookup
 0.72%  [kernel]   [k] netif_receive_skb_internal
 0.71%  [kernel]   [k] page_frag_free
 0.71%  [kernel]   [k] deliver_ptype_list_skb
 0.70%  [kernel]   [k] fib_validate_source
 0.69%  [kernel]   [k] mlx5_cqwq_get_cqe
 0.69%  [kernel]   [k] __netif_receive_skb
 0.68%  [kernel]   [k] vlan_passthru_hard_header
 0.61%  [kernel]   [k] rt_cache_valid
 0.59%  [kernel]   [k] iptable_filter_hook



Without patch:

12,7Mpps forwarded 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-11 Thread Paweł Staszewski

Tested with connectx-5

Without patch

10Mpps - > 16 cores used

   PerfTop:   66258 irqs/sec  kernel:99.3%  exact:  0.0% [4000Hz 
cycles],  (all, 32 CPUs)
--- 



    10.12%  [kernel]   [k] do_raw_spin_lock
 6.31%  [kernel]   [k] fib_table_lookup
 6.12%  [kernel]   [k] mlx5e_handle_rx_cqe_mpwrq
 4.90%  [kernel]   [k] rt_cache_valid
 3.99%  [kernel]   [k] mlx5e_xmit
 3.03%  [kernel]   [k] ip_rcv
 2.68%  [kernel]   [k] __netif_receive_skb_core
 2.54%  [kernel]   [k] skb_dst_force
 2.41%  [kernel]   [k] ip_finish_output2
 2.21%  [kernel]   [k] __build_skb
 2.03%  [kernel]   [k] __dev_queue_xmit
 1.96%  [kernel]   [k] mlx5e_txwqe_complete
 1.79%  [kernel]   [k] ipt_do_table
 1.78%  [kernel]   [k] inet_gro_receive
 1.69%  [kernel]   [k] ip_forward
 1.66%  [kernel]   [k] udp_v4_early_demux
 1.65%  [kernel]   [k] dst_release
 1.56%  [kernel]   [k] ip_rcv_finish
 1.45%  [kernel]   [k] dev_gro_receive
 1.45%  [kernel]   [k] netif_skb_features
 1.39%  [kernel]   [k] mlx5e_poll_tx_cq
 1.35%  [kernel]   [k] mlx5e_txwqe_build_dsegs
 1.35%  [kernel]   [k] ip_route_input_rcu
 1.15%  [kernel]   [k] dev_hard_start_xmit
 1.12%  [kernel]   [k] napi_gro_receive
 1.07%  [kernel]   [k] netif_receive_skb_internal
 0.98%  [kernel]   [k] sch_direct_xmit
 0.95%  [kernel]   [k] kmem_cache_alloc
 0.89%  [kernel]   [k] read_tsc
 0.88%  [kernel]   [k] mlx5e_build_rx_skb
 0.86%  [kernel]   [k] mlx5_cqwq_get_cqe
 0.82%  [kernel]   [k] page_frag_free
 0.78%  [kernel]   [k] __local_bh_enable_ip
 0.69%  [kernel]   [k] skb_network_protocol
 0.68%  [kernel]   [k] __netif_receive_skb
 0.67%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.65%  [kernel]   [k] mlx5e_poll_rx_cq
 0.65%  [kernel]   [k] validate_xmit_skb
 0.60%  [kernel]   [k] eth_type_trans
 0.60%  [kernel]   [k] deliver_ptype_list_skb
 0.60%  [kernel]   [k] fib_validate_source
 0.55%  [kernel]   [k] eth_header
 0.53%  [kernel]   [k] netdev_pick_tx
 0.53%  [kernel]   [k] __napi_alloc_skb
 0.51%  [kernel]   [k] __udp4_lib_lookup
 0.50%  [kernel]   [k] eth_type_vlan
 0.49%  [kernel]   [k] ip_output
 0.49%  [kernel]   [k] page_frag_alloc
 0.49%  [kernel]   [k] ip_finish_output
 0.48%  [kernel]   [k] neigh_connected_output
 0.45%  [kernel]   [k] nf_hook_slow
 0.44%  [kernel]   [k] udp4_gro_receive
 0.39%  [kernel]   [k] mlx5e_features_check
 0.39%  [kernel]   [k] mlx5e_napi_poll
 0.37%  [kernel]   [k] __jhash_nwords
 0.37%  [kernel]   [k] udp_gro_receive
 0.36%  [kernel]   [k] swiotlb_map_page
 0.33%  [kernel]   [k] mlx5_cqwq_get_wqe
 0.33%  [kernel]   [k] __netdev_pick_tx
 0.29%  [kernel]   [k] ktime_get_with_offset
 0.29%  [kernel]   [k] get_dma_ops
 0.29%  [kernel]   [k] validate_xmit_skb_list
 0.26%  [kernel]   [k] vlan_passthru_hard_header
 0.26%  [kernel]   [k] __udp4_lib_lookup_skb
 0.24%  [kernel]   [k] get_dma_ops
 0.24%  [kernel]   [k] skb_release_data
 0.23%  [kernel]   [k] ip_forward_finish
 0.23%  [kernel]   [k] kmem_cache_free_bulk
 0.23%  [kernel]   [k] timekeeping_get_ns
 0.22%  [kernel]   [k] ip_skb_dst_mtu
 0.21%  [kernel]   [k] compound_head
 0.20%  [kernel]   [k] skb_gro_reset_offset
 0.20%  [kernel]   [k] is_swiotlb_buffer
 0.19%  [kernel]   [k] __net_timestamp.isra.90
 0.19%  [kernel]   [k] dst_metric.constprop.61
 0.18%  [kernel]   [k] skb_orphan_frags.constprop.126
 0.18%  [kernel]   [k] _kfree_skb_defer
 0.18%  [kernel]   [k] irq_entries_start
 0.17%  [kernel]   [k] dev_hard_header.constprop.54
 0.17%  [kernel]   [k] dma_mapping_error
 0.17%  [kernel]   [k] neigh_resolve_output




With patch

12Mpps -> 16 cores

   PerfTop:   66209 irqs/sec  kernel:99.3%  exact:  0.0% [4000Hz 
cycles],  (all, 32 CPUs)

---

    10.67%  [kernel]   [k] do_raw_spin_lock
 6.96%  [kernel]   [k] fib_table_lookup
 6.53%  [kernel]   [k] mlx5e_handle_rx_cqe_mpwrq
 4.17%  [kernel]   [k] mlx5e_xmit
 3.22%  [kernel]   [k] ip_rcv
 3.07%  [kernel]   [k] __netif_receive_skb_core
 2.86%  [kernel]   [k] 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-09-09 Thread Paweł Staszewski

Hi


Are there any plans to have this fix normally in kernel ?

Or it is mostly only hack - not longterm fix and need to be different ?


All tests that was done shows that without this patch there is about 
20-30% network forwarding performance degradation when using vlan 
interfaces



Thanks
Paweł



W dniu 2017-08-15 o 03:17, Eric Dumazet pisze:

On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote:


Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.

Something like :

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct 
net_device *dev,
vlan->vlan_proto = proto;
vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
vlan->real_dev= real_dev;
+   dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
vlan->flags   = VLAN_FLAG_REORDER_HDR;
  
  	err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);









Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Jesper Dangaard Brouer

On Tue, 15 Aug 2017 12:05:37 +0200 Paweł Staszewski  
wrote:
> W dniu 2017-08-15 o 12:02, Paweł Staszewski pisze:
> > W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze:  
> >> On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski 
> >>  wrote:
> >>> W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze:  
>  On Tue, 15 Aug 2017 02:38:56 +0200
>  Paweł Staszewski  wrote:  
> > W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:  
> >> On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski 
> >>  wrote:  
[... cut ...]

> >>> Ethtool(enp175s0f1) stat:  8895566 (  8,895,566) <= 
> >>> tx_prio0_packets /sec
> >>> Ethtool(enp175s0f1) stat:640470657 (640,470,657) <= 
> >>> tx_vport_unicast_bytes /sec
> >>> Ethtool(enp175s0f1) stat:  8895427 (  8,895,427) <= 
> >>> tx_vport_unicast_packets /sec
> >>> Ethtool(enp175s0f1) stat:  498 (498) <= tx_xmit_more 
> >>> /sec  
> >>
> >> We are seeing some xmit_more, this is interesting.  Have you noticed,
> >> if (in the VLAN case) there is a queue in the qdisc layer?
> >>
> >> Simply inspect with: tc -s qdisc show dev ixgbe2  
[...]
> > physical interface mq attached with pfifo_fast:
> >
> > tc -s -d qdisc show dev enp175s0f1
> > qdisc mq 0: root
> >  Sent 1397200697212 bytes 3965888669 pkt (dropped 78065663, overlimits 0 
> > requeues 629868)
> >  backlog 0b 0p requeues 629868
> > qdisc pfifo_fast 0: parent :38 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
> > 1 1
> >  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >  backlog 0b 0p requeues 0
> > qdisc pfifo_fast 0: parent :37 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
> > 1 1
> >  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >  backlog 0b 0p requeues 0
[...]

So, it doesn't look like there is any backlog queue.  Although, this
can be difficult to measure/see this way (as the kernel empty the queue
quickly via bulk deq), also given the small amount of xmit_more which
indicate that the queue was likely very small.

There is a "dropped" counter, which indicate that you likely had a
setup (earlier) where you managed to overflow the qdisc queues. 

> just see that after changing RSS on nics did't deleted qdisc and added 
> again:
> Here situation with qdisc del / add
> tc -s -d qdisc show dev enp175s0f1
> qdisc mq 1: root
>   Sent 43738523966 bytes 683414438 pkt (dropped 0, overlimits 0 requeues 1886)
>   backlog 0b 0p requeues 1886
> qdisc pfifo_fast 0: parent 1:10 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 
> 1 1
>   Sent 2585011904 bytes 40390811 pkt (dropped 0, overlimits 0 requeues 110)
>   backlog 0b 0p requeues 110
> qdisc pfifo_fast 0: parent 1:f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 
> 1
>   Sent 2602068416 bytes 40657319 pkt (dropped 0, overlimits 0 requeues 121)
>   backlog 0b 0p requeues 121
[...]

Exactly as you indicated above, these "dropped" stats came from another
(earlier) test case. (Great that you caught this yourself)

While trying to reproduce you case, I also managed to cause a situation
with qdisc overload.  This caused some weird behavior, where I saw
RX=8Mpps and TX only 4Mpps.  (I didn't figure out the exact tuning that
caused this, and cannot reproduce it now).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Paweł Staszewski



W dniu 2017-08-15 o 12:02, Paweł Staszewski pisze:



W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze:
On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski 
 wrote:



W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze:

On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski  wrote:

W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:
On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski 
 wrote:

To show some difference below comparision vlan/no-vlan traffic

10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
performance reduction of about 10-19% when I forward out a VLAN
interface.  This is larger than I expected, but still lower than 
what

you reported 30-40% slowdown.

[...]

Ok mellanox afrrived (MT27700 - mlnx5 driver)
And to compare melannox with vlans and without: 33% performance
degradation (less than with ixgbe where i reach ~40% with same 
settings)


Mellanox without TX traffix on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;11089305;709715520;8871553;567779392
1;16;64;11096292;710162688;11095566;710116224
2;16;64;11095770;710129280;11096799;710195136
3;16;64;11097199;710220736;11097702;710252928
4;16;64;11080984;567081856;11079662;709098368
5;16;64;11077696;708972544;11077039;708930496
6;16;64;11082991;709311424;8864802;567347328
7;16;64;11089596;709734144;8870927;709789184
8;16;64;11094043;710018752;11095391;710105024

Mellanox with TX traffic on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;7369914;471674496;7370281;471697980
1;16;64;7368896;471609408;7368043;471554752
2;16;64;7367577;471524864;7367759;471536576
3;16;64;7368744;377305344;7369391;471641024
4;16;64;7366824;471476736;7364330;471237120
5;16;64;7368352;471574528;7367239;471503296
6;16;64;7367459;471517376;7367806;471539584
7;16;64;7367190;471500160;7367988;471551232
8;16;64;7368023;471553472;7368076;471556864

I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top).  The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).

You can use my ethtool_stats.pl script watch these stats:
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency:  dnf install perl-Time-HiRes)

For RX NIC:
Show adapter(s) (enp175s0f0) statistics (ONLY that changed!)
Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= 
rx0_bytes /sec
Ethtool(enp175s0f0) stat:   230978 (230,978) <= 
rx0_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= 
rx0_csum_complete /sec
Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= 
rx0_packets /sec
Ethtool(enp175s0f0) stat:   921614 (921,614) <= 
rx0_page_reuse /sec
Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= 
rx1_bytes /sec
Ethtool(enp175s0f0) stat:   233343 (233,343) <= 
rx1_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= 
rx1_csum_complete /sec
Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= 
rx1_packets /sec
Ethtool(enp175s0f0) stat:   927793 (927,793) <= 
rx1_page_reuse /sec
Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= 
rx2_bytes /sec
Ethtool(enp175s0f0) stat:   233735 (233,735) <= 
rx2_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= 
rx2_csum_complete /sec
Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= 
rx2_packets /sec
Ethtool(enp175s0f0) stat:   937989 (937,989) <= 
rx2_page_reuse /sec
Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= 
rx3_bytes /sec
Ethtool(enp175s0f0) stat:   230311 (230,311) <= 
rx3_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= 
rx3_csum_complete /sec
Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= 
rx3_packets /sec
Ethtool(enp175s0f0) stat:   922513 (922,513) <= 
rx3_page_reuse /sec
Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= 
rx4_bytes /sec
Ethtool(enp175s0f0) stat:   191969 (191,969) <= 
rx4_cache_reuse /sec
Ethtool(enp175s0f0) stat:   958317 (958,317) <= 
rx4_csum_complete /sec
Ethtool(enp175s0f0) stat:   958317 (958,317) <= 
rx4_packets /sec
Ethtool(enp175s0f0) stat:   766332 (766,332) <= 
rx4_page_reuse /sec
Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= 
rx5_bytes /sec
Ethtool(enp175s0f0) stat:   197150 (197,150) <= 
rx5_cache_reuse /sec
Ethtool(enp175s0f0) stat:   984128 (984,128) <= 
rx5_csum_complete /sec
Ethtool(enp175s0f0) stat:   984128 (984,128) <= 
rx5_packets /sec
Ethtool(enp175s0f0) stat:   786978 (786,978) <= 
rx5_page_reuse 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Jesper Dangaard Brouer

On Tue, 15 Aug 2017 11:11:57 +0200 Paweł Staszewski  
wrote:

> Yes it helped - now there is almost no difference when using vlans or not:
> 
> 10.5Mpps - with vlan
> 
> 11Mpps - without vlan

Great! - it seems like we have pinpointed the root-cause.  It also
demonstrate how big the benefit is of Eric commit (thanks!):
 https://git.kernel.org/torvalds/c/93f154b594fe


> W dniu 2017-08-15 o 03:17, Eric Dumazet pisze:
> > On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote:
> >  
> >> Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.  
> > Something like :
> >
> > diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
> > index 
> > 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
> >  100644
> > --- a/net/8021q/vlan_netlink.c
> > +++ b/net/8021q/vlan_netlink.c
> > @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct 
> > net_device *dev,
> > vlan->vlan_proto = proto;
> > vlan->vlan_id= nla_get_u16(data[IFLA_VLAN_ID]);
> > vlan->real_dev   = real_dev;
> > +   dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
> > vlan->flags  = VLAN_FLAG_REORDER_HDR;
> >   
> > err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Paweł Staszewski



W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze:

On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski  
wrote:


W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze:

On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski  wrote:
  

W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:

On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski  
wrote:
 

To show some difference below comparision vlan/no-vlan traffic

10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
performance reduction of about 10-19% when I forward out a VLAN
interface.  This is larger than I expected, but still lower than what
you reported 30-40% slowdown.

[...]

Ok mellanox afrrived (MT27700 - mlnx5 driver)
And to compare melannox with vlans and without: 33% performance
degradation (less than with ixgbe where i reach ~40% with same settings)

Mellanox without TX traffix on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;11089305;709715520;8871553;567779392
1;16;64;11096292;710162688;11095566;710116224
2;16;64;11095770;710129280;11096799;710195136
3;16;64;11097199;710220736;11097702;710252928
4;16;64;11080984;567081856;11079662;709098368
5;16;64;11077696;708972544;11077039;708930496
6;16;64;11082991;709311424;8864802;567347328
7;16;64;11089596;709734144;8870927;709789184
8;16;64;11094043;710018752;11095391;710105024

Mellanox with TX traffic on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;7369914;471674496;7370281;471697980
1;16;64;7368896;471609408;7368043;471554752
2;16;64;7367577;471524864;7367759;471536576
3;16;64;7368744;377305344;7369391;471641024
4;16;64;7366824;471476736;7364330;471237120
5;16;64;7368352;471574528;7367239;471503296
6;16;64;7367459;471517376;7367806;471539584
7;16;64;7367190;471500160;7367988;471551232
8;16;64;7368023;471553472;7368076;471556864

I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top).  The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).

You can use my ethtool_stats.pl script watch these stats:
   
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency:  dnf install perl-Time-HiRes)

For RX NIC:
Show adapter(s) (enp175s0f0) statistics (ONLY that changed!)
Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec
Ethtool(enp175s0f0) stat:   230978 (230,978) <= rx0_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= rx0_csum_complete 
/sec
Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= rx0_packets /sec
Ethtool(enp175s0f0) stat:   921614 (921,614) <= rx0_page_reuse /sec
Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec
Ethtool(enp175s0f0) stat:   233343 (233,343) <= rx1_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= rx1_csum_complete 
/sec
Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= rx1_packets /sec
Ethtool(enp175s0f0) stat:   927793 (927,793) <= rx1_page_reuse /sec
Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec
Ethtool(enp175s0f0) stat:   233735 (233,735) <= rx2_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= rx2_csum_complete 
/sec
Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= rx2_packets /sec
Ethtool(enp175s0f0) stat:   937989 (937,989) <= rx2_page_reuse /sec
Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec
Ethtool(enp175s0f0) stat:   230311 (230,311) <= rx3_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= rx3_csum_complete 
/sec
Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= rx3_packets /sec
Ethtool(enp175s0f0) stat:   922513 (922,513) <= rx3_page_reuse /sec
Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec
Ethtool(enp175s0f0) stat:   191969 (191,969) <= rx4_cache_reuse /sec
Ethtool(enp175s0f0) stat:   958317 (958,317) <= rx4_csum_complete 
/sec
Ethtool(enp175s0f0) stat:   958317 (958,317) <= rx4_packets /sec
Ethtool(enp175s0f0) stat:   766332 (766,332) <= rx4_page_reuse /sec
Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec
Ethtool(enp175s0f0) stat:   197150 (197,150) <= rx5_cache_reuse /sec
Ethtool(enp175s0f0) stat:   984128 (984,128) <= rx5_csum_complete 
/sec
Ethtool(enp175s0f0) stat:   984128 (984,128) <= rx5_packets /sec
Ethtool(enp175s0f0) stat:   786978 (786,978) <= rx5_page_reuse /sec
Ethtool(enp175s0f0) stat: 79076984 ( 79,076,984) <= 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Jesper Dangaard Brouer

On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski  
wrote:

> W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze:
> > On Tue, 15 Aug 2017 02:38:56 +0200
> > Paweł Staszewski  wrote:
> >  
> >> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:  
> >>> On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski 
> >>>  wrote:
> >>> 
>  To show some difference below comparision vlan/no-vlan traffic
> 
>  10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan  
> >>> I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
> >>> performance reduction of about 10-19% when I forward out a VLAN
> >>> interface.  This is larger than I expected, but still lower than what
> >>> you reported 30-40% slowdown.
> >>>
> >>> [...]  
> >> Ok mellanox afrrived (MT27700 - mlnx5 driver)
> >> And to compare melannox with vlans and without: 33% performance
> >> degradation (less than with ixgbe where i reach ~40% with same settings)
> >>
> >> Mellanox without TX traffix on vlan:
> >> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> >> 0;16;64;11089305;709715520;8871553;567779392
> >> 1;16;64;11096292;710162688;11095566;710116224
> >> 2;16;64;11095770;710129280;11096799;710195136
> >> 3;16;64;11097199;710220736;11097702;710252928
> >> 4;16;64;11080984;567081856;11079662;709098368
> >> 5;16;64;11077696;708972544;11077039;708930496
> >> 6;16;64;11082991;709311424;8864802;567347328
> >> 7;16;64;11089596;709734144;8870927;709789184
> >> 8;16;64;11094043;710018752;11095391;710105024
> >>
> >> Mellanox with TX traffic on vlan:
> >> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> >> 0;16;64;7369914;471674496;7370281;471697980
> >> 1;16;64;7368896;471609408;7368043;471554752
> >> 2;16;64;7367577;471524864;7367759;471536576
> >> 3;16;64;7368744;377305344;7369391;471641024
> >> 4;16;64;7366824;471476736;7364330;471237120
> >> 5;16;64;7368352;471574528;7367239;471503296
> >> 6;16;64;7367459;471517376;7367806;471539584
> >> 7;16;64;7367190;471500160;7367988;471551232
> >> 8;16;64;7368023;471553472;7368076;471556864  
> > I wonder if the drivers page recycler is active/working or not, and if
> > the situation is different between VLAN vs no-vlan (given
> > page_frag_free is so high in you perf top).  The Mellanox drivers
> > fortunately have a stats counter to tell us this explicitly (which the
> > ixgbe driver doesn't).
> >
> > You can use my ethtool_stats.pl script watch these stats:
> >   
> > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> > (Hint perl dependency:  dnf install perl-Time-HiRes)  
> For RX NIC:
> Show adapter(s) (enp175s0f0) statistics (ONLY that changed!)
> Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec
> Ethtool(enp175s0f0) stat:   230978 (230,978) <= rx0_cache_reuse 
> /sec
> Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= rx0_csum_complete 
> /sec
> Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= rx0_packets /sec
> Ethtool(enp175s0f0) stat:   921614 (921,614) <= rx0_page_reuse 
> /sec
> Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec
> Ethtool(enp175s0f0) stat:   233343 (233,343) <= rx1_cache_reuse 
> /sec
> Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= rx1_csum_complete 
> /sec
> Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= rx1_packets /sec
> Ethtool(enp175s0f0) stat:   927793 (927,793) <= rx1_page_reuse 
> /sec
> Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec
> Ethtool(enp175s0f0) stat:   233735 (233,735) <= rx2_cache_reuse 
> /sec
> Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= rx2_csum_complete 
> /sec
> Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= rx2_packets /sec
> Ethtool(enp175s0f0) stat:   937989 (937,989) <= rx2_page_reuse 
> /sec
> Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec
> Ethtool(enp175s0f0) stat:   230311 (230,311) <= rx3_cache_reuse 
> /sec
> Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= rx3_csum_complete 
> /sec
> Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= rx3_packets /sec
> Ethtool(enp175s0f0) stat:   922513 (922,513) <= rx3_page_reuse 
> /sec
> Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec
> Ethtool(enp175s0f0) stat:   191969 (191,969) <= rx4_cache_reuse 
> /sec
> Ethtool(enp175s0f0) stat:   958317 (958,317) <= rx4_csum_complete 
> /sec
> Ethtool(enp175s0f0) stat:   958317 (958,317) <= rx4_packets /sec
> Ethtool(enp175s0f0) stat:   766332 (766,332) <= rx4_page_reuse 
> /sec
> Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec
> Ethtool(enp175s0f0) stat:   197150 (197,150) <= rx5_cache_reuse 
> /sec
> 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Jesper Dangaard Brouer
On Mon, 14 Aug 2017 18:57:50 +0200
Paolo Abeni  wrote:

> On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> > The output (extracted below) didn't show who called 'do_raw_spin_lock',
> > BUT it showed another interesting thing.  The kernel code
> > __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> > as it will first call skb_dst_force() and then skb_dst_drop() when the
> > packet is transmitted on a VLAN.
> > 
> >  static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >  {
> >  [...]
> > /* If device/qdisc don't need skb->dst, release it right now while
> >  * its hot in this cpu cache.
> >  */
> > if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> > skb_dst_drop(skb);
> > else
> > skb_dst_force(skb);  
> 
> I think that the high impact of the above code in this specific test is
> mostly due to the following:
> 
> - ingress packets with different RSS rx hash lands on different CPUs
> - but they use the same dst entry, since the destination IPs belong to
> the same subnet
> - the dst refcnt cacheline is contented between all the CPUs

Good point and explanation Paolo :-)
I changed my pktgen setup to be closer to Pawel's to provoke this
situation some more, and I get closer to provoke this although not as
clearly as Pawel.

A perf diff does show, that the overhead in the VLAN cause originates
from the routing "dst_release" code.  Diff Baseline==non-vlan case.

[jbrouer@canyon ~]$ sudo ~/perf diff
# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object Symbol 
  
#   .    
.
#
 3.23% +4.32%  [kernel.vmlinux]  [k] __dev_queue_xmit
   +3.43%  [kernel.vmlinux]  [k] dst_release
13.54% -3.17%  [kernel.vmlinux]  [k] fib_table_lookup
 9.33% -2.73%  [kernel.vmlinux]  [k] _raw_spin_lock
 7.91% -1.75%  [ixgbe]   [k] ixgbe_poll
   +1.64%  [8021q]   [k] vlan_dev_hard_start_xmit
 7.23% -1.26%  [ixgbe]   [k] ixgbe_xmit_frame_ring
 3.34% -1.10%  [kernel.vmlinux]  [k] eth_type_trans
 5.20% +0.97%  [kernel.vmlinux]  [k] ip_route_input_rcu
 1.13% +0.95%  [kernel.vmlinux]  [k] ip_rcv_finish
 2.49% -0.82%  [kernel.vmlinux]  [k] ip_forward
 3.05% -0.80%  [kernel.vmlinux]  [k] __build_skb
 0.44% +0.74%  [kernel.vmlinux]  [k] __netif_receive_skb
   +0.71%  [kernel.vmlinux]  [k] neigh_connected_output
 1.70% +0.68%  [kernel.vmlinux]  [k] validate_xmit_skb
 1.42% +0.67%  [kernel.vmlinux]  [k] dev_hard_start_xmit
 0.49% +0.66%  [kernel.vmlinux]  [k] netif_receive_skb_internal
   +0.62%  [kernel.vmlinux]  [k] eth_header
   +0.57%  [ixgbe]   [k] ixgbe_tx_ctxtdesc
 1.19% -0.55%  [kernel.vmlinux]  [k] __netdev_pick_tx
 2.54% -0.48%  [kernel.vmlinux]  [k] fib_validate_source
 2.83% +0.46%  [kernel.vmlinux]  [k] ip_finish_output2
 1.45% +0.45%  [kernel.vmlinux]  [k] netif_skb_features
 1.66% -0.45%  [kernel.vmlinux]  [k] napi_gro_receive
 0.90% -0.40%  [kernel.vmlinux]  [k] validate_xmit_skb_list
 1.45% -0.39%  [kernel.vmlinux]  [k] ip_finish_output
   +0.36%  [8021q]   [k] vlan_passthru_hard_header
 1.28% -0.33%  [kernel.vmlinux]  [k] netdev_pick_tx
 

> Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE
> flag for vlan if the underlaying device does not have (relevant)
> classifier attached? (and clearing it as needed)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Paweł Staszewski



W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze:

On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski  wrote:


W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:

On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski  
wrote:
  

To show some difference below comparision vlan/no-vlan traffic

10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
performance reduction of about 10-19% when I forward out a VLAN
interface.  This is larger than I expected, but still lower than what
you reported 30-40% slowdown.

[...]

Ok mellanox afrrived (MT27700 - mlnx5 driver)
And to compare melannox with vlans and without: 33% performance
degradation (less than with ixgbe where i reach ~40% with same settings)

Mellanox without TX traffix on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;11089305;709715520;8871553;567779392
1;16;64;11096292;710162688;11095566;710116224
2;16;64;11095770;710129280;11096799;710195136
3;16;64;11097199;710220736;11097702;710252928
4;16;64;11080984;567081856;11079662;709098368
5;16;64;11077696;708972544;11077039;708930496
6;16;64;11082991;709311424;8864802;567347328
7;16;64;11089596;709734144;8870927;709789184
8;16;64;11094043;710018752;11095391;710105024

Mellanox with TX traffic on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;7369914;471674496;7370281;471697980
1;16;64;7368896;471609408;7368043;471554752
2;16;64;7367577;471524864;7367759;471536576
3;16;64;7368744;377305344;7369391;471641024
4;16;64;7366824;471476736;7364330;471237120
5;16;64;7368352;471574528;7367239;471503296
6;16;64;7367459;471517376;7367806;471539584
7;16;64;7367190;471500160;7367988;471551232
8;16;64;7368023;471553472;7368076;471556864

I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top).  The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).

You can use my ethtool_stats.pl script watch these stats:
  
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency:  dnf install perl-Time-HiRes)

For RX NIC:
Show adapter(s) (enp175s0f0) statistics (ONLY that changed!)
Ethtool(enp175s0f0) stat: 78380071 ( 78,380,071) <= rx0_bytes /sec
Ethtool(enp175s0f0) stat:   230978 (230,978) <= 
rx0_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= 
rx0_csum_complete /sec

Ethtool(enp175s0f0) stat:  1152648 (  1,152,648) <= rx0_packets /sec
Ethtool(enp175s0f0) stat:   921614 (921,614) <= 
rx0_page_reuse /sec

Ethtool(enp175s0f0) stat: 78956591 ( 78,956,591) <= rx1_bytes /sec
Ethtool(enp175s0f0) stat:   233343 (233,343) <= 
rx1_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= 
rx1_csum_complete /sec

Ethtool(enp175s0f0) stat:  1161126 (  1,161,126) <= rx1_packets /sec
Ethtool(enp175s0f0) stat:   927793 (927,793) <= 
rx1_page_reuse /sec

Ethtool(enp175s0f0) stat: 79677124 ( 79,677,124) <= rx2_bytes /sec
Ethtool(enp175s0f0) stat:   233735 (233,735) <= 
rx2_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= 
rx2_csum_complete /sec

Ethtool(enp175s0f0) stat:  1171722 (  1,171,722) <= rx2_packets /sec
Ethtool(enp175s0f0) stat:   937989 (937,989) <= 
rx2_page_reuse /sec

Ethtool(enp175s0f0) stat: 78392893 ( 78,392,893) <= rx3_bytes /sec
Ethtool(enp175s0f0) stat:   230311 (230,311) <= 
rx3_cache_reuse /sec
Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= 
rx3_csum_complete /sec

Ethtool(enp175s0f0) stat:  1152837 (  1,152,837) <= rx3_packets /sec
Ethtool(enp175s0f0) stat:   922513 (922,513) <= 
rx3_page_reuse /sec

Ethtool(enp175s0f0) stat: 65165583 ( 65,165,583) <= rx4_bytes /sec
Ethtool(enp175s0f0) stat:   191969 (191,969) <= 
rx4_cache_reuse /sec
Ethtool(enp175s0f0) stat:   958317 (958,317) <= 
rx4_csum_complete /sec

Ethtool(enp175s0f0) stat:   958317 (958,317) <= rx4_packets /sec
Ethtool(enp175s0f0) stat:   766332 (766,332) <= 
rx4_page_reuse /sec

Ethtool(enp175s0f0) stat: 66920721 ( 66,920,721) <= rx5_bytes /sec
Ethtool(enp175s0f0) stat:   197150 (197,150) <= 
rx5_cache_reuse /sec
Ethtool(enp175s0f0) stat:   984128 (984,128) <= 
rx5_csum_complete /sec

Ethtool(enp175s0f0) stat:   984128 (984,128) <= rx5_packets /sec
Ethtool(enp175s0f0) stat:   786978 (786,978) <= 
rx5_page_reuse /sec

Ethtool(enp175s0f0) stat: 79076984 ( 79,076,984) <= rx6_bytes /sec
Ethtool(enp175s0f0) stat:   233735 (233,735) <= 
rx6_cache_reuse /sec
Ethtool(enp175s0f0) stat:  

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Jesper Dangaard Brouer
On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski  wrote:

> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:
> > On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski  
> > wrote:
> >  
> >> To show some difference below comparision vlan/no-vlan traffic
> >>
> >> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan  
> > I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
> > performance reduction of about 10-19% when I forward out a VLAN
> > interface.  This is larger than I expected, but still lower than what
> > you reported 30-40% slowdown.
> >
> > [...]  
> Ok mellanox afrrived (MT27700 - mlnx5 driver)
> And to compare melannox with vlans and without: 33% performance 
> degradation (less than with ixgbe where i reach ~40% with same settings)
> 
> Mellanox without TX traffix on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;11089305;709715520;8871553;567779392
> 1;16;64;11096292;710162688;11095566;710116224
> 2;16;64;11095770;710129280;11096799;710195136
> 3;16;64;11097199;710220736;11097702;710252928
> 4;16;64;11080984;567081856;11079662;709098368
> 5;16;64;11077696;708972544;11077039;708930496
> 6;16;64;11082991;709311424;8864802;567347328
> 7;16;64;11089596;709734144;8870927;709789184
> 8;16;64;11094043;710018752;11095391;710105024
> 
> Mellanox with TX traffic on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;7369914;471674496;7370281;471697980
> 1;16;64;7368896;471609408;7368043;471554752
> 2;16;64;7367577;471524864;7367759;471536576
> 3;16;64;7368744;377305344;7369391;471641024
> 4;16;64;7366824;471476736;7364330;471237120
> 5;16;64;7368352;471574528;7367239;471503296
> 6;16;64;7367459;471517376;7367806;471539584
> 7;16;64;7367190;471500160;7367988;471551232
> 8;16;64;7368023;471553472;7368076;471556864

I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top).  The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).

You can use my ethtool_stats.pl script watch these stats:
 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency:  dnf install perl-Time-HiRes)


> ethtool settings for both tests:
> ifc='enp175s0f0 enp175s0f1'
> for i in $ifc
>  do
>  ip link set up dev $i
>  ethtool -A $i autoneg off rx off tx off
>  ethtool -G $i rx 128 tx 256

The ring queue size recommendations, might be different for the mlx5
driver (Cc'ing Mellanox maintainers).  


>  ip link set $i txqueuelen 1000
>  ethtool -C $i rx-usecs 25
>  ethtool -L $i combined 16
>  ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off 
> tx-nocache-copy off ntuple on
>  ethtool -N $i rx-flow-hash udp4 sdfn
>  done

Thanks for being explicit about what you setup is :-)
 
> and perf top:
> PerfTop:   83650 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
> cycles],  (all, 56 CPUs)
> ---
> 
>  14.25%  [kernel]   [k] dst_release
>  14.17%  [kernel]   [k] skb_dst_force
>  13.41%  [kernel]   [k] rt_cache_valid
>  11.47%  [kernel]   [k] ip_finish_output2
>   7.01%  [kernel]   [k] do_raw_spin_lock
>   5.07%  [kernel]   [k] page_frag_free
>   3.47%  [mlx5_core][k] mlx5e_xmit
>   2.88%  [kernel]   [k] fib_table_lookup
>   2.43%  [mlx5_core][k] skb_from_cqe.isra.32
>   1.97%  [kernel]   [k] virt_to_head_page
>   1.81%  [mlx5_core][k] mlx5e_poll_tx_cq
>   0.93%  [kernel]   [k] __dev_queue_xmit
>   0.87%  [kernel]   [k] __build_skb
>   0.84%  [kernel]   [k] ipt_do_table
>   0.79%  [kernel]   [k] ip_rcv
>   0.79%  [kernel]   [k] acpi_processor_ffh_cstate_enter
>   0.78%  [kernel]   [k] netif_skb_features
>   0.73%  [kernel]   [k] __netif_receive_skb_core
>   0.52%  [kernel]   [k] dev_hard_start_xmit
>   0.52%  [kernel]   [k] build_skb
>   0.51%  [kernel]   [k] ip_route_input_rcu
>   0.50%  [kernel]   [k] skb_unref
>   0.49%  [kernel]   [k] ip_forward
>   0.48%  [mlx5_core][k] mlx5_cqwq_get_cqe
>   0.44%  [kernel]   [k] udp_v4_early_demux
>   0.41%  [kernel]   [k] napi_consume_skb
>   0.40%  [kernel]   [k] __local_bh_enable_ip
>   0.39%  [kernel]   [k] ip_rcv_finish
>   0.39%  [kernel]   [k] kmem_cache_alloc
>   0.38%  [kernel]   [k] sch_direct_xmit
>   0.33%  [kernel]   [k] validate_xmit_skb
>   0.32%  [mlx5_core][k] mlx5e_free_rx_wqe_reuse
> 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Paweł Staszewski

With hack:

14.44%  [kernel]   [k] do_raw_spin_lock
 8.30%  [kernel]   [k] page_frag_free
 7.06%  [mlx5_core][k] mlx5e_xmit
 5.97%  [kernel]   [k] acpi_processor_ffh_cstate_enter
 5.73%  [kernel]   [k] fib_table_lookup
 4.81%  [mlx5_core][k] mlx5e_poll_tx_cq
 4.51%  [mlx5_core][k] skb_from_cqe.isra.32
 3.81%  [kernel]   [k] virt_to_head_page
 2.45%  [kernel]   [k] __dev_queue_xmit
 1.84%  [kernel]   [k] ipt_do_table
 1.77%  [kernel]   [k] napi_consume_skb
 1.62%  [kernel]   [k] __build_skb
 1.46%  [kernel]   [k] netif_skb_features
 1.43%  [kernel]   [k] __netif_receive_skb_core
 1.41%  [kernel]   [k] ip_rcv
 1.08%  [kernel]   [k] dev_hard_start_xmit
 1.02%  [kernel]   [k] build_skb
 1.00%  [mlx5_core][k] mlx5_cqwq_get_cqe
 0.96%  [kernel]   [k] ip_route_input_rcu
 0.95%  [kernel]   [k] ip_forward
 0.89%  [kernel]   [k] ip_finish_output2
 0.89%  [kernel]   [k] kmem_cache_alloc
 0.78%  [kernel]   [k] __local_bh_enable_ip
 0.76%  [kernel]   [k] udp_v4_early_demux
 0.75%  [kernel]   [k] compound_head
 0.75%  [kernel]   [k] __netdev_pick_tx
 0.73%  [kernel]   [k] sch_direct_xmit
 0.65%  [kernel]   [k] irq_entries_start
 0.63%  [mlx5_core][k] mlx5e_free_rx_wqe_reuse
 0.61%  [kernel]   [k] netdev_pick_tx
 0.61%  [kernel]   [k] validate_xmit_skb
 0.55%  [kernel]   [k] skb_network_protocol
 0.53%  [mlx5_core][k] mlx5e_rx_cache_get
 0.53%  [mlx5_core][k] mlx5e_build_rx_skb
 0.51%  [kernel]   [k] ip_rcv_finish
 0.50%  [kernel]   [k] eth_header
 0.50%  [kernel]   [k] fib_validate_source
 0.50%  [mlx5_core][k] mlx5e_handle_rx_cqe
 0.48%  [mlx5_core][k] eq_update_ci
 0.47%  [kernel]   [k] kmem_cache_free_bulk
 0.44%  [kernel]   [k] deliver_ptype_list_skb
 0.43%  [kernel]   [k] skb_release_data
 0.42%  [kernel]   [k] cpuidle_enter_state
 0.40%  [kernel]   [k] virt_to_head_page
 0.39%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.39%  [kernel]   [k] neigh_connected_output
 0.38%  [kernel]   [k] eth_type_vlan
 0.35%  [mlx5_core][k] mlx5e_alloc_rx_wqe
 0.32%  [kernel]   [k] nf_hook_slow
 0.32%  [kernel]   [k] swiotlb_map_page
 0.31%  [kernel]   [k] ip_finish_output
 0.29%  [kernel]   [k] ip_output
 0.28%  [kernel]   [k] skb_free_head
 0.25%  [kernel]   [k] netif_receive_skb_internal
 0.25%  [kernel]   [k] __jhash_nwords



Without hack:

14.25%  [kernel]   [k] dst_release
14.17%  [kernel]   [k] skb_dst_force
13.41%  [kernel]   [k] rt_cache_valid
11.47%  [kernel]   [k] ip_finish_output2
 7.01%  [kernel]   [k] do_raw_spin_lock
 5.07%  [kernel]   [k] page_frag_free
 3.47%  [mlx5_core][k] mlx5e_xmit
 2.88%  [kernel]   [k] fib_table_lookup
 2.43%  [mlx5_core][k] skb_from_cqe.isra.32
 1.97%  [kernel]   [k] virt_to_head_page
 1.81%  [mlx5_core][k] mlx5e_poll_tx_cq
 0.93%  [kernel]   [k] __dev_queue_xmit
 0.87%  [kernel]   [k] __build_skb
 0.84%  [kernel]   [k] ipt_do_table
 0.79%  [kernel]   [k] ip_rcv
 0.79%  [kernel]   [k] acpi_processor_ffh_cstate_enter
 0.78%  [kernel]   [k] netif_skb_features
 0.73%  [kernel]   [k] __netif_receive_skb_core
 0.52%  [kernel]   [k] dev_hard_start_xmit
 0.52%  [kernel]   [k] build_skb
 0.51%  [kernel]   [k] ip_route_input_rcu
 0.50%  [kernel]   [k] skb_unref
 0.49%  [kernel]   [k] ip_forward
 0.48%  [mlx5_core][k] mlx5_cqwq_get_cqe
 0.44%  [kernel]   [k] udp_v4_early_demux
 0.41%  [kernel]   [k] napi_consume_skb
 0.40%  [kernel]   [k] __local_bh_enable_ip
 0.39%  [kernel]   [k] ip_rcv_finish
 0.39%  [kernel]   [k] kmem_cache_alloc
 0.38%  [kernel]   [k] sch_direct_xmit
 0.33%  [kernel]   [k] validate_xmit_skb
 0.32%  [mlx5_core][k] mlx5e_free_rx_wqe_reuse
 0.29%  [kernel]   [k] netdev_pick_tx
 0.28%  [mlx5_core][k] mlx5e_build_rx_skb
 0.27%  [kernel]   [k] deliver_ptype_list_skb
 0.26%  [kernel]   [k] fib_validate_source
 0.26%  [mlx5_core][k] mlx5e_napi_poll
 0.26%  [mlx5_core][k] mlx5e_handle_rx_cqe
 0.26%  [mlx5_core][k] mlx5e_rx_cache_get
 0.25%  [kernel]   [k] eth_header
 0.23%  [kernel]   [k] skb_network_protocol
 0.20%  [kernel]   [k] nf_hook_slow
 0.20%  [kernel]   [k] vlan_passthru_hard_header
 0.20%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.19%  [kernel]   [k] swiotlb_map_page
 0.18%  [kernel]   [k] compound_head
 0.18%  [kernel]   [k] neigh_connected_output
 0.18%  [mlx5_core]

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-15 Thread Paweł Staszewski

Hi


Yes it helped - now there is almost no difference when using vlans or not:

10.5Mpps - with vlan

11Mpps - without vlan




W dniu 2017-08-15 o 03:17, Eric Dumazet pisze:

On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote:


Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.

Something like :

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct 
net_device *dev,
vlan->vlan_proto = proto;
vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
vlan->real_dev= real_dev;
+   dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
vlan->flags   = VLAN_FLAG_REORDER_HDR;
  
  	err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);









Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Eric Dumazet
On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote:

> Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.

Something like :

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct 
net_device *dev,
vlan->vlan_proto = proto;
vlan->vlan_id= nla_get_u16(data[IFLA_VLAN_ID]);
vlan->real_dev   = real_dev;
+   dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
vlan->flags  = VLAN_FLAG_REORDER_HDR;
 
err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);





Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Eric Dumazet
On Tue, 2017-08-15 at 02:45 +0200, Paweł Staszewski wrote:
> 
> W dniu 2017-08-14 o 18:57, Paolo Abeni pisze:
> > On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> >> The output (extracted below) didn't show who called 'do_raw_spin_lock',
> >> BUT it showed another interesting thing.  The kernel code
> >> __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> >> as it will first call skb_dst_force() and then skb_dst_drop() when the
> >> packet is transmitted on a VLAN.
> >>
> >>   static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >>   {
> >>   [...]
> >>/* If device/qdisc don't need skb->dst, release it right now while
> >> * its hot in this cpu cache.
> >> */
> >>if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> >>skb_dst_drop(skb);
> >>else
> >>skb_dst_force(skb);
> > I think that the high impact of the above code in this specific test is
> > mostly due to the following:
> >
> > - ingress packets with different RSS rx hash lands on different CPUs
> yes but isn't this normal ?
> everybody that want to ballance load over cores will try tu use as many 
> as possible :)
> With some limitations  ... best are 6 to 7 RSS queues - so need to use 6 
> to 7 cpu cores
> 
> > - but they use the same dst entry, since the destination IPs belong to
> > the same subnet
> typical for ddos - many sources one destination

Nobody hit this issue yet.

We usually change the kernel, given typical workloads.

In this case, we might need per cpu nh_rth_input

Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.





Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Paweł Staszewski



W dniu 2017-08-14 o 18:57, Paolo Abeni pisze:

On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:

The output (extracted below) didn't show who called 'do_raw_spin_lock',
BUT it showed another interesting thing.  The kernel code
__dev_queue_xmit() in might create route dst-cache problem for itself(?),
as it will first call skb_dst_force() and then skb_dst_drop() when the
packet is transmitted on a VLAN.

  static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
  {
  [...]
/* If device/qdisc don't need skb->dst, release it right now while
 * its hot in this cpu cache.
 */
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
else
skb_dst_force(skb);

I think that the high impact of the above code in this specific test is
mostly due to the following:

- ingress packets with different RSS rx hash lands on different CPUs

yes but isn't this normal ?
everybody that want to ballance load over cores will try tu use as many 
as possible :)
With some limitations  ... best are 6 to 7 RSS queues - so need to use 6 
to 7 cpu cores



- but they use the same dst entry, since the destination IPs belong to
the same subnet

typical for ddos - many sources one destination



- the dst refcnt cacheline is contented between all the CPUs

Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE
flag for vlan if the underlaying device does not have (relevant)
classifier attached? (and clearing it as needed)

Paolo





Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Paweł Staszewski



W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:

On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski  
wrote:


To show some difference below comparision vlan/no-vlan traffic

10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
performance reduction of about 10-19% when I forward out a VLAN
interface.  This is larger than I expected, but still lower than what
you reported 30-40% slowdown.

[...]

Ok mellanox afrrived (MT27700 - mlnx5 driver)
And to compare melannox with vlans and without: 33% performance 
degradation (less than with ixgbe where i reach ~40% with same settings)


Mellanox without TX traffix on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;11089305;709715520;8871553;567779392
1;16;64;11096292;710162688;11095566;710116224
2;16;64;11095770;710129280;11096799;710195136
3;16;64;11097199;710220736;11097702;710252928
4;16;64;11080984;567081856;11079662;709098368
5;16;64;11077696;708972544;11077039;708930496
6;16;64;11082991;709311424;8864802;567347328
7;16;64;11089596;709734144;8870927;709789184
8;16;64;11094043;710018752;11095391;710105024

Mellanox with TX traffic on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;7369914;471674496;7370281;471697980
1;16;64;7368896;471609408;7368043;471554752
2;16;64;7367577;471524864;7367759;471536576
3;16;64;7368744;377305344;7369391;471641024
4;16;64;7366824;471476736;7364330;471237120
5;16;64;7368352;471574528;7367239;471503296
6;16;64;7367459;471517376;7367806;471539584
7;16;64;7367190;471500160;7367988;471551232
8;16;64;7368023;471553472;7368076;471556864



ethtool settings for both tests:
ifc='enp175s0f0 enp175s0f1'
for i in $ifc
do
ip link set up dev $i
ethtool -A $i autoneg off rx off tx off
ethtool -G $i rx 128 tx 256
ip link set $i txqueuelen 1000
ethtool -C $i rx-usecs 25
ethtool -L $i combined 16
ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off 
tx-nocache-copy off ntuple on

ethtool -N $i rx-flow-hash udp4 sdfn
done

and perf top:
   PerfTop:   83650 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)

---

14.25%  [kernel]   [k] dst_release
14.17%  [kernel]   [k] skb_dst_force
13.41%  [kernel]   [k] rt_cache_valid
11.47%  [kernel]   [k] ip_finish_output2
 7.01%  [kernel]   [k] do_raw_spin_lock
 5.07%  [kernel]   [k] page_frag_free
 3.47%  [mlx5_core][k] mlx5e_xmit
 2.88%  [kernel]   [k] fib_table_lookup
 2.43%  [mlx5_core][k] skb_from_cqe.isra.32
 1.97%  [kernel]   [k] virt_to_head_page
 1.81%  [mlx5_core][k] mlx5e_poll_tx_cq
 0.93%  [kernel]   [k] __dev_queue_xmit
 0.87%  [kernel]   [k] __build_skb
 0.84%  [kernel]   [k] ipt_do_table
 0.79%  [kernel]   [k] ip_rcv
 0.79%  [kernel]   [k] acpi_processor_ffh_cstate_enter
 0.78%  [kernel]   [k] netif_skb_features
 0.73%  [kernel]   [k] __netif_receive_skb_core
 0.52%  [kernel]   [k] dev_hard_start_xmit
 0.52%  [kernel]   [k] build_skb
 0.51%  [kernel]   [k] ip_route_input_rcu
 0.50%  [kernel]   [k] skb_unref
 0.49%  [kernel]   [k] ip_forward
 0.48%  [mlx5_core][k] mlx5_cqwq_get_cqe
 0.44%  [kernel]   [k] udp_v4_early_demux
 0.41%  [kernel]   [k] napi_consume_skb
 0.40%  [kernel]   [k] __local_bh_enable_ip
 0.39%  [kernel]   [k] ip_rcv_finish
 0.39%  [kernel]   [k] kmem_cache_alloc
 0.38%  [kernel]   [k] sch_direct_xmit
 0.33%  [kernel]   [k] validate_xmit_skb
 0.32%  [mlx5_core][k] mlx5e_free_rx_wqe_reuse
 0.29%  [kernel]   [k] netdev_pick_tx
 0.28%  [mlx5_core][k] mlx5e_build_rx_skb
 0.27%  [kernel]   [k] deliver_ptype_list_skb
 0.26%  [kernel]   [k] fib_validate_source
 0.26%  [mlx5_core][k] mlx5e_napi_poll
 0.26%  [mlx5_core][k] mlx5e_handle_rx_cqe
 0.26%  [mlx5_core][k] mlx5e_rx_cache_get
 0.25%  [kernel]   [k] eth_header
 0.23%  [kernel]   [k] skb_network_protocol
 0.20%  [kernel]   [k] nf_hook_slow
 0.20%  [kernel]   [k] vlan_passthru_hard_header
 0.20%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.19%  [kernel]   [k] swiotlb_map_page
 0.18%  [kernel]   [k] compound_head
 0.18%  [kernel]   [k] neigh_connected_output
 0.18%  [mlx5_core][k] mlx5e_alloc_rx_wqe
 0.18%  [kernel]   [k] ip_output
 0.17%  [kernel]   [k] prefetch_freepointer.isra.70
 0.17%  [kernel]   [k] __slab_free
 0.16%  [kernel]   [k] eth_type_vlan
 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Paolo Abeni
On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> The output (extracted below) didn't show who called 'do_raw_spin_lock',
> BUT it showed another interesting thing.  The kernel code
> __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> as it will first call skb_dst_force() and then skb_dst_drop() when the
> packet is transmitted on a VLAN.
> 
>  static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>  {
>  [...]
>   /* If device/qdisc don't need skb->dst, release it right now while
>* its hot in this cpu cache.
>*/
>   if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
>   skb_dst_drop(skb);
>   else
>   skb_dst_force(skb);

I think that the high impact of the above code in this specific test is
mostly due to the following:

- ingress packets with different RSS rx hash lands on different CPUs
- but they use the same dst entry, since the destination IPs belong to
the same subnet
- the dst refcnt cacheline is contented between all the CPUs

Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE
flag for vlan if the underlaying device does not have (relevant)
classifier attached? (and clearing it as needed)

Paolo


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Eric Dumazet
On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> The output (extracted below) didn't show who called 'do_raw_spin_lock',
> BUT it showed another interesting thing.  The kernel code
> __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> as it will first call skb_dst_force() and then skb_dst_drop() when the
> packet is transmitted on a VLAN.
> 
>  static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>  {
>  [...]
>   /* If device/qdisc don't need skb->dst, release it right now while
>* its hot in this cpu cache.
>*/
>   if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
>   skb_dst_drop(skb);
>   else
>   skb_dst_force(skb);

This is explained in this commit changelog.

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=93f154b594fe47e4a7e5358b309add449a046cd3




Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Jesper Dangaard Brouer

On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski  
wrote:

> To show some difference below comparision vlan/no-vlan traffic
> 
> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
performance reduction of about 10-19% when I forward out a VLAN
interface.  This is larger than I expected, but still lower than what
you reported 30-40% slowdown.

[...]

> >>> perf top:
> >>>
> >>>PerfTop:   77835 irqs/sec  kernel:99.7%  
> >>> -
> >>>
> >>>   16.32%  [kernel]   [k] skb_dst_force
> >>>   16.30%  [kernel]   [k] dst_release
> >>>   15.11%  [kernel]   [k] rt_cache_valid
> >>>   12.62%  [kernel]   [k] ipv4_mtu  
> >> It seems a little strange that these 4 functions are on the top  

I don't see these in my test.

> >>  
> >>>5.60%  [kernel]   [k] do_raw_spin_lock  
> >> Why is calling/taking this lock? (Use perf call-graph recording).  
> > can be hard to paste it here:)
> > attached file

The attached was very big. Please don't attach so big file on mailing
lists.  Next time plase share them via e.g. pastebin. The output was a
capture from your terminal, which made the output more difficult to
read.  Hint: You can/could use perf --stdio and place it in a file
instead.

The output (extracted below) didn't show who called 'do_raw_spin_lock',
BUT it showed another interesting thing.  The kernel code
__dev_queue_xmit() in might create route dst-cache problem for itself(?),
as it will first call skb_dst_force() and then skb_dst_drop() when the
packet is transmitted on a VLAN.

 static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 {
 [...]
/* If device/qdisc don't need skb->dst, release it right now while
 * its hot in this cpu cache.
 */
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
else
skb_dst_force(skb);


- - 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Extracted part of attached perf output:

 --5.37%--ip_rcv_finish
   |  
   |--4.02%--ip_forward
   |   |  
   |--3.92%--ip_forward_finish
   |   |  
   |--3.91%--ip_output
   |  |  
   |   --3.90%--ip_finish_output
   |  |  
   |   --3.88%--ip_finish_output2
   |  |  
   |   --2.77%--neigh_connected_output
   | |  
   |  --2.74%--dev_queue_xmit
   | |  
   |  --2.73%--__dev_queue_xmit
   | |  
   | |--1.66%--dev_hard_start_xmit
   | |   |  
   | |--1.64%--vlan_dev_hard_start_xmit
   | |   |  
   | |--1.63%--dev_queue_xmit
   | |   |  
   | |--1.62%--__dev_queue_xmit
   | |   |  
   | |   |--0.99%--skb_dst_drop.isra.77
   | |   |   |  
   | |   |   --0.99%--dst_release
   | |   |  
   | |--0.55%--sch_direct_xmit
   | |  
   |  --0.99%--skb_dst_force
   |  
--1.29%--ip_route_input_noref
|  
 --1.29%--ip_route_input_rcu
 |  
  --1.05%--rt_cache_valid


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Paweł Staszewski



W dniu 2017-08-14 o 02:07, Alexander Duyck pisze:

On Sat, Aug 12, 2017 at 10:27 AM, Paweł Staszewski
 wrote:

Hi and thanks for reply



W dniu 2017-08-12 o 14:23, Jesper Dangaard Brouer pisze:

On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski
 wrote:


Hi

I made some tests for performance comparison.

Thanks for doing this. Feel free to Cc me, if you do more of these
tests (so I don't miss them on the mailing list).

I don't understand stand if you are reporting a potential problem?

It would be good if you can provide a short summary section (of the
issue) in the _start_ of the email, and then provide all this nice data
afterwards, to back your case.

My understanding is, you report:

1. VLANs on ixgbe show a 30-40% slowdown
2. System stopped scaling after 7+ CPUs

So I had read through most of this before I realized what it was you
were reporting. As far as the behavior there are a few things going
on. I have some additional comments below but they are mostly based on
what I had read up to that point.

As far as possible issues for item 1. The VLAN adds 4 bytes of data of
the payload, when it is stripped it can result in a packet that is 56
bytes. These missing 8 bytes can cause issues as it forces the CPU to
do a read/modify/write every time the device writes to the 64B cache
line instead of just doing it as a single write. This can be very
expensive and hurt performance. In addition it adds 4 bytes on the
wire, so if you are sending the same 64B packets over the VLAN
interface it is bumping them up to 68B to make room for the VLAN tag.
I am suspecting you are encountering one of these type of issues. You
might try tweaking the packet sizes in increments of 4 to see if there
is a sweet spot that you might be falling out of or into.

No this is not a problem with 4byte header or soo

Cause topology is like this

TX generator (pktgen) physical interface no vlan -> RX physical 
interface (no vlan) [ FORWARDING HOST ] TX vlan interface binded to 
physical interface -> SINK


below data for packet size 70 (pktgen PKT_SIZE: 70)

ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX

0;16;70;7246720;434749440;7245917;420269856
1;16;70;7249152;434872320;7248885;420434344
2;16;70;7249024;434926080;7249225;420401400
3;16;70;7249984;434952960;7249448;420435736
4;16;70;7251200;435064320;7250990;420495244
5;16;70;7241408;434592000;7241781;420068074
6;16;70;7229696;433689600;7229750;419268196
7;16;70;7236032;434127360;7236133;419669092
8;16;70;7236608;434161920;7236274;419695830
9;16;70;7226496;433578240;7227501;419107826

100% cpu load on all 16 cores

the difference vlan/no vlan currently on this host varries from 40 to 
even 50% (but cant check if can reach 50% performance degradation cause 
pktgen can give me only 10Mpps with 70% of cpu load for forwarding host 
(soo still place to forward maybee at line rate 14Mpps)



Item 2 is a known issue with the NICs supported by ixgbe, at least for
anything 82599 and later. The issue here is that there isn't really an
Rx descriptor cache so to try and optimize performance the hardware
will try to write back as many descriptors it has ready for the ring
requesting writeback. The problem is as you add more rings it means
the writes get smaller as they are triggering more often. So what you
end up seeing is that for each additional ring you add the performance
starts dropping as soon as the rings are no longer being fully
saturated. You can tell this has happened when the CPUs in use
suddenly all stop reporting 100% softirq use. So for example to
perform at line rate with 64B packets you would need something like
XDP and to keep the ring count small, like maybe 2 rings. Any more
than that and the performance will start to drop as you hit PCIe
bottlenecks.


This is not only problem/bug report  - but some kind of comparision plus
some toughts about possible problems :)
And can help somebody when searching the net for possible expectations :)
Also - dono better list where are the smartest people that know what is
going in kernel with networking :)

Next time i will place summary on top - sorry :)


Tested HW (FORWARDING HOST):

Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

Interesting, I've not heard about a Intel CPU called "Gold" before now,
but it does exist:

https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz



Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

This is one of my all time favorite NICs!

Yes this is a good NIC - will have connectx-4 2x100G by monday so will also
do some tests


Test diagram:


TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST
(enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK

Forwarder traffic: UDP random ports from 9 to 19 with random hosts from
172.16.0.1 to 172.16.0.255

TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen)

What kind of traffic flow?  E.g. distribution, many/few source 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-13 Thread Alexander Duyck
On Sat, Aug 12, 2017 at 10:27 AM, Paweł Staszewski
 wrote:
> Hi and thanks for reply
>
>
>
> W dniu 2017-08-12 o 14:23, Jesper Dangaard Brouer pisze:
>>
>> On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski
>>  wrote:
>>
>>> Hi
>>>
>>> I made some tests for performance comparison.
>>
>> Thanks for doing this. Feel free to Cc me, if you do more of these
>> tests (so I don't miss them on the mailing list).
>>
>> I don't understand stand if you are reporting a potential problem?
>>
>> It would be good if you can provide a short summary section (of the
>> issue) in the _start_ of the email, and then provide all this nice data
>> afterwards, to back your case.
>>
>> My understanding is, you report:
>>
>> 1. VLANs on ixgbe show a 30-40% slowdown
>> 2. System stopped scaling after 7+ CPUs

So I had read through most of this before I realized what it was you
were reporting. As far as the behavior there are a few things going
on. I have some additional comments below but they are mostly based on
what I had read up to that point.

As far as possible issues for item 1. The VLAN adds 4 bytes of data of
the payload, when it is stripped it can result in a packet that is 56
bytes. These missing 8 bytes can cause issues as it forces the CPU to
do a read/modify/write every time the device writes to the 64B cache
line instead of just doing it as a single write. This can be very
expensive and hurt performance. In addition it adds 4 bytes on the
wire, so if you are sending the same 64B packets over the VLAN
interface it is bumping them up to 68B to make room for the VLAN tag.
I am suspecting you are encountering one of these type of issues. You
might try tweaking the packet sizes in increments of 4 to see if there
is a sweet spot that you might be falling out of or into.

Item 2 is a known issue with the NICs supported by ixgbe, at least for
anything 82599 and later. The issue here is that there isn't really an
Rx descriptor cache so to try and optimize performance the hardware
will try to write back as many descriptors it has ready for the ring
requesting writeback. The problem is as you add more rings it means
the writes get smaller as they are triggering more often. So what you
end up seeing is that for each additional ring you add the performance
starts dropping as soon as the rings are no longer being fully
saturated. You can tell this has happened when the CPUs in use
suddenly all stop reporting 100% softirq use. So for example to
perform at line rate with 64B packets you would need something like
XDP and to keep the ring count small, like maybe 2 rings. Any more
than that and the performance will start to drop as you hit PCIe
bottlenecks.

> This is not only problem/bug report  - but some kind of comparision plus
> some toughts about possible problems :)
> And can help somebody when searching the net for possible expectations :)
> Also - dono better list where are the smartest people that know what is
> going in kernel with networking :)
>
> Next time i will place summary on top - sorry :)
>
>>
>>> Tested HW (FORWARDING HOST):
>>>
>>> Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>
>> Interesting, I've not heard about a Intel CPU called "Gold" before now,
>> but it does exist:
>>
>> https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz
>>
>>
>>> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
>>
>> This is one of my all time favorite NICs!
>
> Yes this is a good NIC - will have connectx-4 2x100G by monday so will also
> do some tests
>
>>
>>>
>>> Test diagram:
>>>
>>>
>>> TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST
>>> (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK
>>>
>>> Forwarder traffic: UDP random ports from 9 to 19 with random hosts from
>>> 172.16.0.1 to 172.16.0.255
>>>
>>> TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen)
>>
>> What kind of traffic flow?  E.g. distribution, many/few source IPs...
>
>
> Traffic generator is pktgen so udp flows - better paste parameters from
> pktgen:
> UDP_MIN=9
> UDP_MAX=19
>
> pg_set $dev "dst_min 172.16.0.1"
> pg_set $dev "dst_max 172.16.0.100"
>
> # Setup random UDP port src range
> #pg_set $dev "flag UDPSRC_RND"
> pg_set $dev "flag UDPSRC_RND"
> pg_set $dev "udp_src_min $UDP_MIN"
> pg_set $dev "udp_src_max $UDP_MAX"
>
>
>>
>>
>>>
>>> Settings used for FORWARDING HOST (changed param. was only number of RSS
>>> combined queues + set affinity assignment for them to fit with first
>>> numa node where 2x10G port card is installed)
>>>
>>> ixgbe driver used from kernel (in-kernel build - not a module)
>>>
>> Nice with a script showing you setup, thanks. I would be good if it had
>> comments, telling why you think this is a needed setup adjustment.
>>
>>> #!/bin/sh
>>> ifc='enp216s0f0 enp216s0f1'
>>> for i in $ifc
>>>   do
>>>   ip link set up dev $i
>>>   ethtool -A $i autoneg 

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-13 Thread Paweł Staszewski

To show some difference below comparision vlan/no-vlan traffic

10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

(ixgbe in kernel driver kernel 4.13.0-rc4-next-20170811)

ethtool settings for both tests:

ethtool -K $ifc gro off tso off gso off sg on l2-fwd-offload off 
tx-nocache-copy off ntuple off


ethtool -L $ifc combined 16

ethtool -C $ifc rx-usecs 2

ethtool -G $ifc rx 4096 tx 1024

16 CORES / 16 RSS QUEUES


Tx traffic on vlan:

RX Interface:

enp216s0f0

TX Interface

vlan1000 added to enp216s0f1 interface (with vlan 1000 ip address assigned)

ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;6939008;416325120;6938696;402411192
1;16;64;6941952;416444160;6941745;402558918
2;16;64;6960576;417584640;6960707;403698718
3;16;64;6940736;416486400;6941820;402503876
4;16;64;6927680;415741440;6927420;401853870
5;16;64;6929792;415687680;6929917;401839196
6;16;64;6950400;416989440;6950661;403026166
7;16;64;6953664;417216000;6953454;403260544
8;16;64;6948480;416851200;6948800;403023266
9;16;64;6924160;415422720;6924092;401542468

100% load on all 16 Cores.


vs

RX interface from traffic generator:

enp216s0f0

TX interface to the sink:

enp216s0f1

No vlan used

ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX

0;16;64;10280176;793608540;10298496;596796568
1;16;64;10046928;600978780;10046022;582527002
2;16;64;10032956;601827420;10026097;581515656
3;16;64;10051503;602252460;10067880;582420804
4;16;64;10016204;602725140;10017358;582644800
5;16;64;10035575;602437620;10059504;582067294
6;16;64;10041667;603069780;10057865;582477412
7;16;64;1008;600027420;10046526;581022018
8;16;64;10022436;601121100;10025946;581904314
9;16;64;10036231;602514960;10058724;582180684


So we have 10Mpps forwarded

- have problems with pktgen on my traffic generator to push more than 
10M but this low budget hardware so.. :)



And there are still free cpu cycles so probabbly can forward at line 10G 
rate 14Mpps


Average: CPU%usr   %nice%sys %iowait%irq   %soft %steal  
%guest  %gnice   %idle
Average: all0.000.000.000.000.00 20.91
0.000.000.00   79.09
Average:   00.000.000.000.000.00 0.090.00
0.000.00   99.91
Average:   10.030.000.030.000.00 0.000.00
0.000.00   99.94
Average:   20.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:   30.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:   40.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:   50.000.000.180.000.00 0.000.00
0.000.00   99.82
Average:   60.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:   70.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:   80.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:   90.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:  100.000.000.030.240.00 0.000.00
0.000.00   99.74
Average:  110.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:  120.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:  130.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:  140.000.000.000.000.00 92.38
0.000.000.007.62
Average:  150.000.000.000.000.00 85.88
0.000.000.00   14.12
Average:  160.000.000.000.000.00 64.91
0.000.000.00   35.09
Average:  170.000.000.000.000.00 66.76
0.000.000.00   33.24
Average:  180.000.000.000.000.00 65.57
0.000.000.00   34.43
Average:  190.000.000.000.000.00 66.38
0.000.000.00   33.62
Average:  200.000.000.000.000.00 72.97
0.000.000.00   27.03
Average:  210.000.000.000.000.00 70.80
0.000.000.00   29.20
Average:  220.000.000.000.000.00 66.44
0.000.000.00   33.56
Average:  230.000.000.000.000.00 66.12
0.000.000.00   33.88
Average:  240.000.000.000.000.00 68.35
0.000.000.00   31.65
Average:  250.000.000.000.000.00 71.79
0.000.000.00   28.21
Average:  260.000.000.000.000.00 70.24
0.000.000.00   29.76
Average:  270.000.000.000.000.00 73.24
0.000.000.00   26.76
Average:  280.000.000.000.000.00 0.000.00
0.000.00  100.00
Average:  290.000.000.000.000.00 0.00

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-12 Thread Jesper Dangaard Brouer

On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski  
wrote:

> Hi
> 
> I made some tests for performance comparison.

Thanks for doing this. Feel free to Cc me, if you do more of these
tests (so I don't miss them on the mailing list).

I don't understand stand if you are reporting a potential problem?

It would be good if you can provide a short summary section (of the
issue) in the _start_ of the email, and then provide all this nice data
afterwards, to back your case.

My understanding is, you report:

1. VLANs on ixgbe show a 30-40% slowdown
2. System stopped scaling after 7+ CPUs


> Tested HW (FORWARDING HOST):
> 
> Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

Interesting, I've not heard about a Intel CPU called "Gold" before now,
but it does exist:
 
https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz


> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

This is one of my all time favorite NICs!
 
> Test diagram:
> 
> 
> TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST 
> (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK
> 
> Forwarder traffic: UDP random ports from 9 to 19 with random hosts from 
> 172.16.0.1 to 172.16.0.255
> 
> TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen)

What kind of traffic flow?  E.g. distribution, many/few source IPs...

 
> Settings used for FORWARDING HOST (changed param. was only number of RSS 
> combined queues + set affinity assignment for them to fit with first 
> numa node where 2x10G port card is installed)
> 
> ixgbe driver used from kernel (in-kernel build - not a module)
> 

Nice with a script showing you setup, thanks. I would be good if it had
comments, telling why you think this is a needed setup adjustment.

> #!/bin/sh
> ifc='enp216s0f0 enp216s0f1'
> for i in $ifc
>  do
>  ip link set up dev $i
>  ethtool -A $i autoneg off rx off tx off

Good:
 Turning off Ethernet flow control, to avoid receiver being the
 bottleneck via pause-frames.

>  ethtool -G $i rx 4096 tx 1024

You adjust the RX and TX ring queue sizes, this have effects that you
don't realize.  Especially for the ixgbe driver, which have a page
recycle trick tied to the RX ring queue size.

>  ip link set $i txqueuelen 1000

Setting tx queue len to the default 1000 seems redundant.

>  ethtool -C $i rx-usecs 10

Adjusting this also have effects you might not realize.  This actually
also affect the page recycle scheme of ixgbe.  And can sometimes be
used to solve stalling on DMA TX completions, which could be you issue
here.


>  ethtool -L $i combined 16
>  ethtool -K $i gro on tso on gso off sg on l2-fwd-offload off 
> tx-nocache-copy on ntuple on

Here are many setting above.

GRO/GSO/TSO for _forwarding_ is actually bad... in my tests, enabling
this result in approx 10% slowdown.

AFAIK "tx-nocache-copy on" was also determined to be a bad option.

The "ntuple on" AFAIK disables the flow-director in the NIC.  I though
this would actually help VLAN traffic, but I guess not.


>  ethtool -N $i rx-flow-hash udp4 sdfn

Why do you change the NICs flow-hash?

>  done
> 
> ip link set up dev enp216s0f0
> ip link set up dev enp216s0f1
> 
> ip a a 10.0.0.1/30 dev enp216s0f0
> 
> ip link add link enp216s0f1 name vlan1000 type vlan id 1000
> ip link set up dev vlan1000
> ip a a 10.0.0.5/30 dev vlan1000
> 
> 
> ip route add 172.16.0.0/12 via 10.0.0.6
> 
> ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f0
> ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f1
> #cat  /sys/devices/system/node/node1/cpulist
> #14-27,42-55
> #cat  /sys/devices/system/node/node0/cpulist
> #0-13,28-41

Is this a NUMA system?

 
> #
> 
> 
> Looks like forwarding performance when using vlans on ixgbe is less that 
> without vlans for about 30-40% (wondering if this is some vlan 
> offloading problem and ixgbe)

I would see this as a problem/bug that enabling VLANs cost this much.
 
> settings below:
> 
> ethtool -k enp216s0f0
> Features for enp216s0f0:
> Cannot get device udp-fragmentation-offload settings: Operation not 
> supported
> rx-checksumming: on
> tx-checksumming: on
>  tx-checksum-ipv4: off [fixed]
>  tx-checksum-ip-generic: on
>  tx-checksum-ipv6: off [fixed]
>  tx-checksum-fcoe-crc: off [fixed]
>  tx-checksum-sctp: on
> scatter-gather: on
>  tx-scatter-gather: on
>  tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
>  tx-tcp-segmentation: on
>  tx-tcp-ecn-segmentation: off [fixed]
>  tx-tcp-mangleid-segmentation: on
>  tx-tcp6-segmentation: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: off
> generic-receive-offload: on
> large-receive-offload: off
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: on
> receive-hashing: on
> highdma: on [fixed]