Hi Florin,

Thanks once again for looking at this issue. Please see inline:

On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> wrote:

> Hi Vijay,
>
> Inline.
>
> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>
> Hi Florin,
>
> Thanks for the response. Please see inline:
>
> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Hi Vijay,
>>
>> Cool experiment. More inline.
>>
>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine
>> connected to another server running VPP using 100G NICs. Both servers are
>> Intel Xeon with 24 cores.
>>
>> May I ask the frequency for those cores? Also what type of nic are you
>> using?
>>
>
> 2700 MHz.
>
>
> Probably this somewhat limits throughput per single connection compared to
> my testbed where the Intel cpu boosts to 4GHz.
>

Please see below, I noticed an anomaly.


> The nic is a Pensando DSC100.
>
>
> Okay, not sure what to expect there. Since this mostly stresses the rx
> side, what’s the number of rx descriptors? Typically I test with 256, with
> more connections higher throughput you might need more.
>

This is the default - comments seem to suggest that is 1024. I don't see
any rx queue empty errors on the nic, which probably means there are
sufficient buffers.

> > I am trying to push 100G traffic from the iperf Linux TCP client by
>> starting 10 parallel iperf connections on different port numbers and
>> pinning them to different cores on the sender side. On the VPP receiver
>> side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3
>> using VCL library as follows
>> >
>> > taskset 0x00400 sh -c
>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>> > taskset 0x00800 sh -c
>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>> > ...
>> >
>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>> >
>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>> > ...
>>
>> Could you try first with only 1 iperf server/client pair, just to see
>> where performance is with that?
>>
>
> These are the numbers I get
> rx-fifo-size 65536: ~8G
> rx-fifo-size 524288: 22G
> rx-fifo-size 4000000: 25G
>
>
> Okay, so 4MB is probably the sweet spot. Btw, could you check the vector
> rate (and the errors) in this case also?
>

I noticed that adding "enable-tcp-udp-checksum" back seems to improve
performance. Not sure if this is an issue with the dpdk driver for the nic.
Anyway in the "show hardware" flags I see now that tcp and udp checksum
offloads are enabled:

root@server:~# vppctl show hardware
              Name                Idx   Link  Hardware
eth0                               1     up   dsc1
  Link speed: 100 Gbps
  Ethernet address 00:ae:cd:03:79:51
  ### UNKNOWN ###
    carrier up full duplex mtu 9000
    flags: admin-up pmd maybe-multiseg rx-ip4-cksum
    Devargs:
    rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
    tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
    pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa 0
    max rx packet len: 9208
    promiscuous: unicast off all-multicast on
    vlan offload: strip off filter off qinq off
    rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter
                       jumbo-frame scatter
    rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter
    tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso
                       outer-ipv4-cksum multi-segs mbuf-fast-free
outer-udp-cksum
    tx offload active: multi-segs
    rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
    rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
    tx burst function: ionic_xmit_pkts
    rx burst function: ionic_recv_pkts

With this I get better performance per iperf3 connection - about 30.5G.
Show run output attached (1connection.txt)



> rx-fifo-size 8000000: 25G
>
>
>>
>> >
>> > I see that I am not able to push beyond 50-60G. I tried different sizes
>> for the vcl rx-fifo-size - 64K, 256K and 1M. With 1M fifo size, I see that
>> tcp latency as reported on the client increases, but not a significant
>> improvement in bandwidth. Are there any suggestions to achieve 100G
>> bandwidth? I am using a vpp build from master.
>>
>> Depends a lot on how many connections you’re running in parallel. With
>> only one connection, buffer occupancy might go up, so 1-2MB might be better.
>>
>>
>
> With the current run I increased this to 8000000.
>
>>
>> Could you also check how busy vpp is with “clear run” wait at least 1
>> second and then “show run”. That will give you per node/worker vector
>> rates. If they go above 100 vectors/dispatch the workers are busy so you
>> could increase their number and implicitly the number of sessions. Note
>> however that RSS is not perfect so you can get more connections on one
>> worker.
>>
>
> I am attaching the output of this to the email (10 iperf connections, 4
> worker threads)
>
>
> It’s clearly saturated. Could also do a “clear error”/“show error” and
> “clear tcp stats”/“show tcp stats”?
>
> Because this is purely a server/receiver scenario for vpp, and because
> tcp4-established seems to need a lot of clocks, make sure that iperf runs
> on the same numa vpp’s workers and the nic run on. To see the nic’s numa,
> “show hardware”.
>
> For instance, in my testbed at ~37.5Gbps and 1 connection,
> tcp4-established needs around 7e2 clocks. In your case it goes as high as
> 1.2e4, so it doesn’t look it’s only frequency related.
>

I now repeated this test with all cores and nic on numa 0. Cores 1-4 are
used by VPP and 5-11 by iperf. I get about 63G. I am attaching the vpp
statistics for this case (7connection.txt). Looks like in this case nothing
is hashing to core 4.

> Pasting below the output of vpp and vcl conf files:
>> >
>> > cpu {
>> >   main-core 0
>> >   workers 10
>>
>> You can pin vpp’s workers to cores with: corelist-workers c1,c3-cN to
>> avoid overlap with iperf. You might want to start with 1 worker and work
>> your way up from there. In my testing, 1 worker should be enough to
>> saturate a 40Gbps nic with 1 iperf connection. Maybe you need a couple more
>> to reach 100, but I wouldn’t expect more.
>>
>
> I changed this to 4 cores and pinned them as you suggested.
>
>
> See above wrt how vpp’s workers, iperf and the nic should all be on the
> same numa. Make sure iperf and vpp’s workers don’t overlap.
>

Done.


>
>
>
>>
>> > }
>> >
>> > buffers {
>> >   buffers-per-numa 65536
>>
>> Unless you need the buffers for something else, 16k might be enough.
>>
>> >   default data-size 9216
>>
>> Hm, no idea about the impact of this on performance. Session layer can
>> build chained buffers so you can also try with this removed to see if it
>> changes anything.
>>
>
> For now, I kept this setting.
>
>
> If possible, try with 1460 mtu and 2kB buffers, to see if that changes
> anything.
>

Sure I will try this. I am hitting some issues with the link not coming up
when I reduce the buffer data-size. It could be a driver issue.


>
>> > }
>> >
>> > dpdk {
>> >   dev 0000:15:00.0 {
>> >         name eth0
>> >         num-rx-queues 10
>>
>> Keep this in sync with the number of workers
>>
>> >   }
>> >   enable-tcp-udp-checksum
>>
>> This enables sw checksum. For better performance, you’ll have to remove
>> it. It will be needed however if you want to turn tso on.
>>
>
> ok. removed.
>
>
>>
>> > }
>> >
>> > session {
>> >   evt_qs_memfd_seg
>> > }
>> > socksvr { socket-name /tmp/vpp-api.sock}
>> >
>> > tcp {
>> >   mtu 9216
>> >   max-rx-fifo 262144
>>
>> This is only used to compute the window scale factor. Given that your
>> fifos might be larger, I would remove it. By default the value is 32MB and
>> gives a wnd_scale of 10 (should be okay).
>>
>
> When I was testing with Linux TCP stack on both sides, I was restricting
> the receive window per socket using net.ipv4.tcp_rmem to get better latency
> numbers. I want to mimic that with VPP. What is the right way to restrict
> the rcv_wnd on VPP?
>
>
> The rcv_wnd is controlled by the rx fifo size. This value will limit the
> wnd_scale and the actual fifo size, if larger than 256kB, won’t be
> correctly advertised. So it would be better to remove this and only control
> it from rx fifo.
>

Sure, so I assume rx-fifo-size in vcl.conf is a per socket fifo size?

Thanks,

Vijay
root@server:~# vppctl show run
Thread 0 vpp_main (lcore 0)
Time 3.3, 10 sec internal node vector rate 0.00 loops/sec 1123145.49
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
cnat-scanner-process            any wait                 0               0      
         4          2.03e3            0.00
dpdk-process                    any wait                 0               0      
         1          1.21e4            0.00
fib-walk                        any wait                 0               0      
         2          2.98e3            0.00
ikev2-manager-process           any wait                 0               0      
         4          2.10e3            0.00
ip6-mld-process                 any wait                 0               0      
         4          1.07e3            0.00
ip6-ra-process                  any wait                 0               0      
         4          8.94e2            0.00
session-queue-main               polling            576464               0      
         0          1.06e2            0.00
session-queue-process           any wait                 0               0      
         3          1.37e3            0.00
unix-cli-local:3                 active                  1               0      
         2          3.29e8            0.00
unix-cli-new-session            any wait                 0               0      
         3          1.79e3            0.00
unix-epoll-input                 polling            576464               0      
         0          1.21e4            0.00
wg-timer-manager                any wait                 0               0      
       326          2.62e2            0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 3.3, 10 sec internal node vector rate 1.00 loops/sec 6238348.06
  vector rates in 6.1281e-1, out 0.0000e0, drop 6.1281e-1, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling          21020380               2      
         0          1.09e9            0.00
drop                             active                  2               2      
         0          6.07e2            1.00
error-drop                       active                  2               2      
         0          6.19e2            1.00
ethernet-input                   active                  2               2      
         0          6.29e2            1.00
llc-input                        active                  2               2      
         0          2.70e2            1.00
session-queue                    polling          21020380               0      
         0          1.51e2            0.00
unix-epoll-input                 polling             20508               0      
         0          4.48e2            0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 3.3, 10 sec internal node vector rate 0.00 loops/sec 6569148.91
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling          22052253               0      
         0          1.01e2            0.00
session-queue                    polling          22052253               0      
         0          1.49e2            0.00
unix-epoll-input                 polling             21514               0      
         0          4.39e2            0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 3.3, 10 sec internal node vector rate 96.69 loops/sec 2954.17
  vector rates in 4.6875e5, out 2.9038e3, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling              9477         1520339      
         0          1.71e2          160.42
dsc1-output                      active               9477            9477      
         0          6.90e2            1.00
dsc1-tx                          active               9477            9477      
         0          1.05e3            1.00
ethernet-input                   active               9477         1520339      
         0          1.59e1          160.42
ip4-input-no-checksum            active               9477         1520339      
         0          2.02e1          160.42
ip4-local                        active               9477         1520339      
         0          1.82e3          160.42
ip4-lookup                       active               9481         1529816      
         0          2.52e1          161.36
ip4-rewrite                      active               9477            9477      
         0          4.15e2            1.00
session-queue                    polling              9477            9477      
         0          1.55e3            1.00
tcp4-established                 active               9477         1520339      
         0          2.56e3          160.42
tcp4-input                       active               9477         1520339      
         0          7.27e1          160.42
tcp4-output                      active               9477            9477      
         0          8.20e2            1.00
unix-epoll-input                 polling                 9               0      
         0          1.45e3            0.00
---------------
Thread 4 vpp_wk_3 (lcore 4)
Time 3.3, 10 sec internal node vector rate 0.00 loops/sec 6490532.87
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling          21727288               0      
         0          1.01e2            0.00
session-queue                    polling          21727288               0      
         0          1.49e2            0.00
unix-epoll-input                 polling             21198               0      
         0          4.37e2            0.00


root@server:~# vppctl show error
   Count                    Node                  Reason
         1               snap-input               unknown oui/snap protocol
         2                llc-input               unknown llc ssap/dsap
      8822              session-queue             Packets transmitted
   1415395            tcp4-established            Packets pushed into rx fifo
      8822               tcp4-output              Packets sent
root@server:~# vppctl clear tcp stats
root@server:~# vppctl show tcp stats
Thread 0:
Thread 1:
Thread 2:
Thread 3:
Thread 4:
root@server:~# vppctl show run
Thread 0 vpp_main (lcore 0)
Time 5.2, 10 sec internal node vector rate 0.00 loops/sec 1253686.91
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
cnat-scanner-process            any wait                 0               0      
         5          3.20e3            0.00
dpdk-process                    any wait                 0               0      
         2          2.12e4            0.00
fib-walk                        any wait                 0               0      
         3          5.00e3            0.00
ikev2-manager-process           any wait                 0               0      
         5          3.19e3            0.00
ip4-full-reassembly-expire-wal  any wait                 0               0      
         1          4.05e3            0.00
ip4-sv-reassembly-expire-walk   any wait                 0               0      
         1          3.80e3            0.00
ip6-full-reassembly-expire-wal  any wait                 0               0      
         1          3.38e3            0.00
ip6-mld-process                 any wait                 0               0      
         5          1.72e3            0.00
ip6-ra-process                  any wait                 0               0      
         5          1.47e3            0.00
ip6-sv-reassembly-expire-walk   any wait                 0               0      
         1          5.27e3            0.00
session-queue-main               polling            922847               0      
         0          1.07e2            0.00
session-queue-process           any wait                 0               0      
         5          2.19e3            0.00
statseg-collector-process       time wait                0               0      
         1          3.80e4            0.00
unix-cli-local:11                active                  1               0      
         2          2.57e9            0.00
unix-cli-new-session            any wait                 0               0      
         3          2.38e3            0.00
unix-epoll-input                 polling            922847               0      
         0          1.21e4            0.00
wg-timer-manager                any wait                 0               0      
       522          3.50e2            0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 5.2, 10 sec internal node vector rate 140.52 loops/sec 1243.57
  vector rates in 3.2308e5, out 2.5045e3, drop 7.6472e-1, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling              6550         1676800      
         0          3.49e2          256.00
drop                             active                  4               4      
         0          2.32e3            1.00
dsc1-output                      active               6550           13100      
         0          6.32e2            2.00
dsc1-tx                          active               6550           13100      
         0          9.30e2            2.00
error-drop                       active                  4               4      
         0          2.41e3            1.00
ethernet-input                   active               6550         1676800      
         0          2.02e1          256.00
ip4-input-no-checksum            active               6550         1676796      
         0          2.06e1          255.99
ip4-local                        active               6550         1676796      
         0          3.04e3          255.99
ip4-lookup                       active              13100         1689896      
         0          2.62e1          128.99
ip4-rewrite                      active               6550           13100      
         0          2.74e2            2.00
llc-input                        active                  4               4      
         0          1.99e3            1.00
session-queue                    polling              6550           13100      
         0          1.58e3            2.00
snap-input                       active                  1               1      
         0          6.12e3            1.00
tcp4-established                 active               6550         1676796      
         0          3.26e3          255.99
tcp4-input                       active               6550         1676796      
         0          8.89e1          255.99
tcp4-output                      active               6550           13100      
         0          1.19e3            2.00
unix-epoll-input                 polling                 7               0      
         0          2.55e3            0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 5.2, 10 sec internal node vector rate 140.96 loops/sec 1246.79
  vector rates in 3.2448e5, out 3.6639e3, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling              6555         1678080      
         0          3.10e2          256.00
dsc1-output                      active               6555           19165      
         0          5.81e2            2.92
dsc1-tx                          active               6555           19165      
         0          7.15e2            2.92
ethernet-input                   active               6555         1678080      
         0          1.86e1          256.00
ip4-input-no-checksum            active               6555         1678080      
         0          1.97e1          256.00
ip4-local                        active               6555         1678080      
         0          3.16e3          256.00
ip4-lookup                       active              13110         1697245      
         0          2.60e1          129.46
ip4-rewrite                      active               6555           19165      
         0          1.93e2            2.92
session-queue                    polling              6555           19165      
         0          1.29e3            2.92
tcp4-established                 active               6555         1678080      
         0          3.17e3          256.00
tcp4-input                       active               6555         1678080      
         0          9.17e1          256.00
tcp4-output                      active               6555           19165      
         0          7.29e2            2.92
unix-epoll-input                 polling                 6               0      
         0          3.11e3            0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 5.2, 10 sec internal node vector rate 140.55 loops/sec 1218.51
  vector rates in 3.2258e5, out 2.5006e3, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling              6540         1674240      
         0          3.49e2          256.00
dsc1-output                      active               6540           13080      
         0          6.63e2            2.00
dsc1-tx                          active               6540           13080      
         0          1.03e3            2.00
ethernet-input                   active               6540         1674240      
         0          1.97e1          256.00
ip4-input-no-checksum            active               6540         1674240      
         0          2.04e1          256.00
ip4-local                        active               6540         1674240      
         0          3.12e3          256.00
ip4-lookup                       active              13080         1687320      
         0          2.65e1          129.00
ip4-rewrite                      active               6540           13080      
         0          2.66e2            2.00
session-queue                    polling              6540           13080      
         0          1.61e3            2.00
tcp4-established                 active               6540         1674240      
         0          3.19e3          256.00
tcp4-input                       active               6540         1674240      
         0          9.08e1          256.00
tcp4-output                      active               6540           13080      
         0          9.91e2            2.00
unix-epoll-input                 polling                 6               0      
         0          2.21e3            0.00
---------------
Thread 4 vpp_wk_3 (lcore 4)
Time 5.2, 10 sec internal node vector rate 0.00 loops/sec 6384609.01
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call
dpdk-input                       polling          34700727               0      
         0          1.01e2            0.00
session-queue                    polling          34700727               0      
         0          1.50e2            0.00
unix-epoll-input                 polling             33854               0      
         0          5.62e2            0.00
root@server:~# vppctl clear error
root@server:~# vppctl show error
   Count                    Node                  Reason
     16046              session-queue             Packets transmitted
   2053885            tcp4-established            Packets pushed into rx fifo
     16046               tcp4-output              Packets sent
         3                llc-input               unknown llc ssap/dsap
     23632              session-queue             Packets transmitted
   2049856            tcp4-established            Packets pushed into rx fifo
      6944            tcp4-established            OOO packets pushed into rx 
fifo
     23632               tcp4-output              Packets sent
     15912              session-queue             Packets transmitted
   2036736            tcp4-established            Packets pushed into rx fifo
     15912               tcp4-output              Packets sent
root@server:~# vppctl clear tcp stats
root@server:~# vppctl show tcp stats
Thread 0:
Thread 1:
Thread 2:
Thread 3:
Thread 4:
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17383): https://lists.fd.io/g/vpp-dev/message/17383
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to