On Fri, Nov 06, 2020 at 08:03:36AM +0100, Otto Moerbeek wrote:
> On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > bluhm and I make some network performance measurements and kernel
> > profiling.
> >
> > Setup: Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> >
> > We figured out, that the kernel uses a huge amount of processing time
> > for sending ACKs to the sender on the receiving interface. After
> > receiving a data segment, we send our two ACK. The first one in
> > tcp_input() direct after receiving. The second ACK is send out, after
> > the userland or the sosplice task read some data out of the socket
> > buffer.
> >
> > The fist ACK in tcp_input() is called after receiving every other data
> > segment like it is discribed in RFC1122:
> >
> > 4.2.3.2 When to Send an ACK Segment
> > A TCP SHOULD implement a delayed ACK, but an ACK should
> > not be excessively delayed; in particular, the delay
> > MUST be less than 0.5 seconds, and in a stream of
> > full-sized segments there SHOULD be an ACK for at least
> > every second segment.
> >
> > This advice is based on the paper "Congestion Avoidance and Control":
> >
> > 4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > The 8 KBps senders were talking to 4.3+BSD receivers
> > which would delay an ack for atmost one packet (because
> > of an ack’s clock’ role, the authors believe that the
> > minimum ack frequency should be every other packet).
> >
> > Sending the first ACK (on every other packet) coasts us too much
> > processing time. Thus, we run into a full socket buffer earlier. The
> > first ACK just acknowledges the received data, but does not update the
> > window. The second ACK, caused by the socket buffer reader, also
> > acknowledges the data and also updates the window. So, the second ACK,
> > is much more worth for a fast packet processing than the fist one.
> >
> > The performance improvement is between 33% with splicing and 20% without
> > splice:
> >
> > splicing relaying
> >
> > current 3.1 GBit/s 2.6 GBit/s
> > w/o first ack 4.1 GBit/s 3.1 GBit/s
> >
> > As far as I understand the implementation of other operating systems:
> > Linux has implement a custom TCP_QUICKACK socket option, to turn this
> > kind of feature on and off. FreeBSD and NetBSD sill depend on it, when
> > using the New Reno implementation.
> >
> > The following diff turns off the direct ACK on every other segment. We
> > are running this diff in production on our own machines at genua and on
> > our products for several month, now. We don't noticed any problems,
> > even with interactive network sessions (ssh) nor with bulk traffic.
> >
> > Another solution could be a sysctl(3) or an additional socket option,
> > similar to Linux, to control this behavior per socket or system wide.
> > Also, a counter to ACK every 3rd, 4th... data segment could beat the
> > problem.
>
> I am wondering if you also looked at another scenario: the process
> reading the soecket is sleeping so the receive buffer fills up without
> any acks being sent. Won't that lead to a lot of retransmissions
> containing data?
No, an ACK will always send out after the delayed ACK timer is
triggered. So, its shouldn't be a problem, when nobody on the system
reads from the socket buffer.
> > Index: netinet/tcp_input.c
> > ===================================================================
> > RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> > retrieving revision 1.365
> > diff -u -p -r1.365 tcp_input.c
> > --- netinet/tcp_input.c 19 Jun 2020 22:47:22 -0000 1.365
> > +++ netinet/tcp_input.c 5 Nov 2020 23:00:34 -0000
> > @@ -165,8 +165,8 @@ do { \
> > #endif
> >
> > /*
> > - * Macro to compute ACK transmission behavior. Delay the ACK unless
> > - * we have already delayed an ACK (must send an ACK every two segments).
> > + * Macro to compute ACK transmission behavior. Delay the ACK until
> > + * a read from the socket buffer or the delayed ACK timer causes one.
> > * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
> > * option is enabled or when the packet is coming from a loopback
> > * interface.
> > @@ -176,8 +176,7 @@ do { \
> > struct ifnet *ifp = NULL; \
> > if (m && (m->m_flags & M_PKTHDR)) \
> > ifp = if_get(m->m_pkthdr.ph_ifidx); \
> > - if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
> > - (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > + if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
> > tp->t_flags |= TF_ACKNOW; \
> > else \
> >
>