Hi,

bluhm and I make some network performance measurements and kernel
profiling.

Setup:  Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)

We figured out, that the kernel uses a huge amount of processing time
for sending ACKs to the sender on the receiving interface.  After
receiving a data segment, we send our two ACK.  The first one in
tcp_input() direct after receiving.  The second ACK is send out, after
the userland or the sosplice task read some data out of the socket
buffer.

The fist ACK in tcp_input() is called after receiving every other data
segment like it is discribed in RFC1122:

        4.2.3.2  When to Send an ACK Segment
                A TCP SHOULD implement a delayed ACK, but an ACK should
                not be excessively delayed; in particular, the delay
                MUST be less than 0.5 seconds, and in a stream of
                full-sized segments there SHOULD be an ACK for at least
                every second segment.

This advice is based on the paper "Congestion Avoidance and Control":

        4 THE GATEWAY SIDE OF CONGESTION CONTROL
                The 8 KBps senders were talking to 4.3+BSD receivers
                which would delay an ack for atmost one packet (because
                of an ack’s clock’ role, the authors believe that the
                minimum ack frequency should be every other packet).

Sending the first ACK (on every other packet) coasts us too much
processing time.  Thus, we run into a full socket buffer earlier.  The
first ACK just acknowledges the received data, but does not update the
window.  The second ACK, caused by the socket buffer reader, also
acknowledges the data and also updates the window.  So, the second ACK,
is much more worth for a fast packet processing than the fist one.

The performance improvement is between 33% with splicing and 20% without
splice:

                        splicing        relaying

        current         3.1 GBit/s      2.6 GBit/s
        w/o first ack   4.1 GBit/s      3.1 GBit/s

As far as I understand the implementation of other operating systems:
Linux has implement a custom TCP_QUICKACK socket option, to turn this
kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
using the New Reno implementation.

The following diff turns off the direct ACK on every other segment.  We
are running this diff in production on our own machines at genua and on
our products for several month, now.  We don't noticed any problems,
even with interactive network sessions (ssh) nor with bulk traffic.

Another solution could be a sysctl(3) or an additional socket option,
similar to Linux, to control this behavior per socket or system wide.
Also, a counter to ACK every 3rd, 4th... data segment could beat the
problem.

bye,
Jan

Index: netinet/tcp_input.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.365
diff -u -p -r1.365 tcp_input.c
--- netinet/tcp_input.c 19 Jun 2020 22:47:22 -0000      1.365
+++ netinet/tcp_input.c 5 Nov 2020 23:00:34 -0000
@@ -165,8 +165,8 @@ do { \
 #endif
 
 /*
- * Macro to compute ACK transmission behavior.  Delay the ACK unless
- * we have already delayed an ACK (must send an ACK every two segments).
+ * Macro to compute ACK transmission behavior.  Delay the ACK until
+ * a read from the socket buffer or the delayed ACK timer causes one.
  * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
  * option is enabled or when the packet is coming from a loopback
  * interface.
@@ -176,8 +176,7 @@ do { \
        struct ifnet *ifp = NULL; \
        if (m && (m->m_flags & M_PKTHDR)) \
                ifp = if_get(m->m_pkthdr.ph_ifidx); \
-       if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
-           (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
+       if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
            (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
                tp->t_flags |= TF_ACKNOW; \
        else \

Reply via email to