Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.

Pavel Vajarov Thu, 07 May 2020 22:05:08 -0700

Thanks for the response.

The F-stack has TSO option in the config file which we turned ON for the
tests.
I'll check fd.io.


On Thu, May 7, 2020 at 11:31 PM Stephen Hemminger <
[email protected]> wrote:

> On Thu, 7 May 2020 07:09:44 -0700
> dave seddon <[email protected]> wrote:
>
> > tc qdisc
> > https://linux.die.net/man/8/tc
> >
> > On Thu, May 7, 2020 at 3:47 AM Pavel Vajarov <[email protected]> wrote:
> >
> > > On Wed, May 6, 2020 at 5:55 PM Stephen Hemminger <
> > > [email protected]>
> > > wrote:
> > >
> > > > On Wed, 6 May 2020 08:14:20 +0300
> > > > Pavel Vajarov <[email protected]> wrote:
> > > >
> > > > > Hi there,
> > > > >
> > > > > We are trying to compare the performance of DPDK+FreeBSD
> networking
> > > stack
> > > > > vs standard Linux kernel and we have problems finding out why the
> > > former
> > > > is
> > > > > slower. The details are below.
> > > > >
> > > > > There is a project called F-Stack <
> https://github.com/F-Stack/f-stack
> > > >.
> > > > > It glues the networking stack from
> > > > > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> > > > > transparent
> > > > > TCP proxy based on F-Stack and another one running on Standard
> Linux
> > > > > kernel.
> > > > > We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139
> CPU @
> > > > > 2.30GHz)
> > > > > and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
> > > > > The application level code, the one which handles epoll
> notifications
> > > and
> > > > > memcpy data between the sockets, of the both proxy applications is
> 100%
> > > > the
> > > > > same. Both proxy applications are single threaded and in all tests
> we
> > > > > pinned the applications on core 1. The interrupts from the network
> card
> > > > > were pinned to the same core 1 for the test with the standard Linux
> > > > > application.
> > > > >
> > > > > Here are the test results:
> > > > > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps
> before
> > > it
> > > > > started to throttle the traffic. No visible CPU usage was observed
> on
> > > > core
> > > > > 0 during the tests, only core 1, where the application and the
> IRQs
> > > were
> > > > > pinned, took the load.
> > > > > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before
> it
> > > > > started to throttle the traffic. No visible CPU usage was observed
> on
> > > > core
> > > > > 0 during the tests only core 1, where the application was pinned,
> took
> > > > the
> > > > > load. In some of the latter tests I did some changes to the number
> of
> > > > read
> > > > > packets in one call from the network card and the number of
> handled
> > > > events
> > > > > in one call to epoll. With these changes I was able to increase the
> > > > > throughput
> > > > > to 900-1000 Mbps but couldn't increase it more.
> > > > > 3. We did another test with the DPDK+FreeBSD proxy just to give us
> some
> > > > > more info about the problem. We disabled the TCP proxy
> functionality
> > > and
> > > > > let the packets be simply ip forwarded by the FreeBSD stack. In
> this
> > > test
> > > > > we reached up to 5Gbps without being able to throttle the traffic.
> We
> > > > just
> > > > > don't have more traffic to redirect there at the moment. So the
> > > bottlneck
> > > > > seem to be either in the upper level of the network stack or in the
> > > > > application
> > > > > code.
> > > > >
> > > > > There is a huawei switch which redirects the traffic to this
> server. It
> > > > > regularly
> > > > > sends arping and if the server doesn't respond it stops the
> > > redirection.
> > > > > So we assumed that when the redirection stops it's because the
> server
> > > > > throttles the traffic and drops packets and can't respond to the
> arping
> > > > > because
> > > > > of the packets drop.
> > > > >
> > > > > The whole application can be very roughly represented in the
> following
> > > > way:
> > > > >  - Write pending outgoing packets to the network card
> > > > > - Read incoming packets from the network card
> > > > >  - Push the incoming packets to the FreeBSD stack
> > > > >  - Call epoll_wait/kevent without waiting
> > > > >  - Handle the events
> > > > >  - loop from the beginning
> > > > > According to the performance profiling that we did, aside from
> packet
> > > > > processing,
> > > > >  about 25-30% of the application time seems to be spent in the
> > > > > epoll_wait/kevent
> > > > > even though the `timeout` parameter of this call is set to 0 i.e.
> > > > > it shouldn't block waiting for events if there is none.
> > > > >
> > > > > I can give you much more details and code for everything, if
> needed.
> > > > >
> > > > > My questions are:
> > > > > 1. Does somebody have observations or educated guesses about what
> > > amount
> > > > of
> > > > > traffic should I expect the DPDK + FreeBSD stack + kevent to
> process in
> > > > the
> > > > > above
> > > > > scenario? Are the numbers low or expected?
> > > > > We've expected to see better performance than the standard Linux
> kernel
> > > > one
> > > > > but
> > > > > so far we can't get this performance.
> > > > > 2. Do you think the diffrence comes because of the time spending
> > > handling
> > > > > packets
> > > > > and handling epoll in both of the tests? What do I mean. For the
> > > standard
> > > > > Linux tests
> > > > > the interrupts handling has higher priority than the epoll
> handling and
> > > > > thus the application
> > > > > can spend much more time handling packets and processing them in
> the
> > > > kernel
> > > > > than
> > > > > handling epoll events in the user space. For the DPDK+FreeBSD case
> the
> > > > time
> > > > > for
> > > > > handling packets and the time for processing epolls is kind of
> equal. I
> > > > > think, that this was
> > > > > the reason why we were able to get more performance increasing
> the
> > > number
> > > > > of read
> > > > > packets at one go and decreasing the epoll events. However, we
> couldn't
> > > > > increase the
> > > > > throughput enough with these tweaks.
> > > > > 3. Can you suggest something else that we can test/measure/profile
> to
> > > get
> > > > > better idea
> > > > > what exactly is happening here and to improve the performance more?
> > > > >
> > > > > Any help is appreciated!
> > > > >
> > > > > Thanks in advance,
> > > > > Pavel.
> > > >
> > > > First off, if you are testing on KVM, are you using PCI pass thru
> or
> > > SR-IOV
> > > > to make the device available to the guest directly. The default mode
> uses
> > > > a Linux bridge, and this results in multiple copies and context
> switches.
> > > > You end up testing Linux bridge and virtio performance, not TCP.
> > > >
> > > > To get full speed with TCP and most software stacks you need TCP
> > > > segmentation
> > > > offload.
> > > >
> > > > Also software queue discipline, kernel version, and TCP congestion
> > > control
> > > > can have a big role in your result.
> > > >
> > >
> > > Hi,
> > >
> > > Thanks for the response.
> > >
> > > We did the tests on Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-96-generic
> > > x86_64).
> > > The NIC was given to the guest using SR-IOV.
> > > The TCP segmentation offload was enabled for both tests (standard
> Linux and
> > > DPDK+FreeBSD).
> > > The congestion control algorithm for both tests was 'cubic'.
> > >
> > > What do you mean by 'software queue discipline'?
>
> The default qdisc in Ubuntu should be fq_codel (see tc qdisc show)
> and that in general has a positive effect on reducing bufferbloat.
>
> F-stack probably doesn't use TSO, you might want to look at TCP stack
> from FD.io for comparison.
>
>
>

Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.

Reply via email to