Re: [vpp-dev] 100G with iperf3 server using VCL library

Florin Coras Fri, 11 Sep 2020 14:06:12 -0700

Hi Vijay, 

Inline.


> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
> 
> Hi Florin,
> 
> Thanks for the response. Please see inline:
> 
> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi Vijay, 
> 
> Cool experiment. More inline. 
> 
> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com 
> > <mailto:vsamp...@gmail.com>> wrote:
> > 
> > Hi,
> > 
> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine connected to 
> > another server running VPP using 100G NICs. Both servers are Intel Xeon 
> > with 24 cores.
> 
> May I ask the frequency for those cores? Also what type of nic are you using?
> 
> 2700 MHz.

Probably this somewhat limits throughput per single connection compared to my 
testbed where the Intel cpu boosts to 4GHz. 

> The nic is a Pensando DSC100.

Okay, not sure what to expect there. Since this mostly stresses the rx side, 
what’s the number of rx descriptors? Typically I test with 256, with more 
connections higher throughput you might need more. 

>  
> 
> > I am trying to push 100G traffic from the iperf Linux TCP client by 
> > starting 10 parallel iperf connections on different port numbers and 
> > pinning them to different cores on the sender side. On the VPP receiver 
> > side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3 
> > using VCL library as follows
> > 
> > taskset 0x00400 sh -c 
> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
> > taskset 0x00800 sh -c 
> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
> > ...
> > 
> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
> > 
> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
> > ...
> 
> Could you try first with only 1 iperf server/client pair, just to see where 
> performance is with that? 
> 
> These are the numbers I get
> rx-fifo-size 65536: ~8G
> rx-fifo-size 524288: 22G
> rx-fifo-size 4000000: 25G

Okay, so 4MB is probably the sweet spot. Btw, could you check the vector rate 
(and the errors) in this case also?  

> rx-fifo-size 8000000: 25G
>  
> 
> > 
> > I see that I am not able to push beyond 50-60G. I tried different sizes for 
> > the vcl rx-fifo-size - 64K, 256K and 1M. With 1M fifo size, I see that tcp 
> > latency as reported on the client increases, but not a significant 
> > improvement in bandwidth. Are there any suggestions to achieve 100G 
> > bandwidth? I am using a vpp build from master.
> 
> Depends a lot on how many connections you’re running in parallel. With only 
> one connection, buffer occupancy might go up, so 1-2MB might be better. 
> 
> With the current run I increased this to 8000000. 
> 
> Could you also check how busy vpp is with “clear run” wait at least 1 second 
> and then “show run”. That will give you per node/worker vector rates. If they 
> go above 100 vectors/dispatch the workers are busy so you could increase 
> their number and implicitly the number of sessions. Note however that RSS is 
> not perfect so you can get more connections on one worker.  
> 
> I am attaching the output of this to the email (10 iperf connections, 4 
> worker threads)

It’s clearly saturated. Could also do a “clear error”/“show error” and “clear 
tcp stats”/“show tcp stats”? 

Because this is purely a server/receiver scenario for vpp, and because 
tcp4-established seems to need a lot of clocks, make sure that iperf runs on 
the same numa vpp’s workers and the nic run on. To see the nic’s numa, “show 
hardware”. 

For instance, in my testbed at ~37.5Gbps and 1 connection, tcp4-established 
needs around 7e2 clocks. In your case it goes as high as 1.2e4, so it doesn’t 
look it’s only frequency related. 

> 
> 
> > 
> > Pasting below the output of vpp and vcl conf files:
> > 
> > cpu {
> >   main-core 0
> >   workers 10
> 
> You can pin vpp’s workers to cores with: corelist-workers c1,c3-cN to avoid 
> overlap with iperf. You might want to start with 1 worker and work your way 
> up from there. In my testing, 1 worker should be enough to saturate a 40Gbps 
> nic with 1 iperf connection. Maybe you need a couple more to reach 100, but I 
> wouldn’t expect more. 
> 
> I changed this to 4 cores and pinned them as you suggested.

See above wrt how vpp’s workers, iperf and the nic should all be on the same 
numa. Make sure iperf and vpp’s workers don’t overlap. 

>  
> 
> > }
> > 
> > buffers {
> >   buffers-per-numa 65536
> 
> Unless you need the buffers for something else, 16k might be enough. 
> 
> >   default data-size 9216
> 
> Hm, no idea about the impact of this on performance. Session layer can build 
> chained buffers so you can also try with this removed to see if it changes 
> anything. 
> 
> For now, I kept this setting.

If possible, try with 1460 mtu and 2kB buffers, to see if that changes 
anything. 

>  
> 
> > }
> > 
> > dpdk {
> >   dev 0000:15:00.0 {
> >         name eth0
> >         num-rx-queues 10
> 
> Keep this in sync with the number of workers
> 
> >   }
> >   enable-tcp-udp-checksum
> 
> This enables sw checksum. For better performance, you’ll have to remove it. 
> It will be needed however if you want to turn tso on. 
> 
> ok. removed.
>  
> 
> > }
> > 
> > session {
> >   evt_qs_memfd_seg
> > }
> > socksvr { socket-name /tmp/vpp-api.sock}
> > 
> > tcp {
> >   mtu 9216
> >   max-rx-fifo 262144
> 
> This is only used to compute the window scale factor. Given that your fifos 
> might be larger, I would remove it. By default the value is 32MB and gives a 
> wnd_scale of 10 (should be okay). 
> 
> When I was testing with Linux TCP stack on both sides, I was restricting the 
> receive window per socket using net.ipv4.tcp_rmem to get better latency 
> numbers. I want to mimic that with VPP. What is the right way to restrict the 
> rcv_wnd on VPP?

The rcv_wnd is controlled by the rx fifo size. This value will limit the 
wnd_scale and the actual fifo size, if larger than 256kB, won’t be correctly 
advertised. So it would be better to remove this and only control it from rx 
fifo. 

>  
> 
> > }
> > 
> > vcl.conf:
> > vcl {
> >   max-workers 1
> 
> No need to constrain it
> 
> >   rx-fifo-size 262144
> >   tx-fifo-size 262144
> 
> As previously mentioned you can configure them to be larger. 
> 
> Made them 8000000.
>  
> Attaching the show run output with 4 workers to this email. Still getting 
> about 50G.

Ack. Let’s see if the next set of improvements yield something. 

Regards, 
Florin

> 
> Thanks,
> 
> Vijay
> <show_run.txt>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17382): https://lists.fd.io/g/vpp-dev/message/17382
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] 100G with iperf3 server using VCL library

Reply via email to