Re: diff: tcp ack improvement

2021-02-08 Thread Jan Klemkow
On Mon, Feb 08, 2021 at 03:42:54PM +0100, Alexander Bluhm wrote:
> On Wed, Feb 03, 2021 at 11:20:04AM +0100, Claudio Jeker wrote:
> > Just commit it. OK claudio@
> > If people see problems we can back it out again.
> 
> This has huge impact on TCP performance.
> 
> http://bluhm.genua.de/perform/results/2021-02-07T00%3A01%3A40Z/perform.html
> 
> For a single TCP connection between to OpenBSD boxes, througput
> drops by 77% from 3.1 GBit/sec to 710 MBit/sec.  But with 100
> parallel connections the througput over all increases by 5%.

For single connections our kernel is limited to send out 4 max TCP
segments.  I don't see that, because I just measured with 10 and 30
streams in parallel.

FreeBSD disabled it 20 yeas ago.
https://github.com/freebsd/freebsd-src/commit/d912c694ee00de5ea0f46743295a0fc603cab562

I would suggest to remove the whole feature.

bye,
Jan

Index: tcp.h
===
RCS file: /cvs/src/sys/netinet/tcp.h,v
retrieving revision 1.21
diff -u -p -r1.21 tcp.h
--- tcp.h   10 Jul 2019 18:45:31 -  1.21
+++ tcp.h   8 Feb 2021 17:52:38 -
@@ -105,8 +105,6 @@ struct tcphdr {
 #defineTCP_MAX_SACK3   /* Max # SACKs sent in any segment */
 #defineTCP_SACKHOLE_LIMIT 128  /* Max # SACK holes per connection */
 
-#defineTCP_MAXBURST4   /* Max # packets after leaving Fast 
Rxmit */
-
 /*
  * Default maximum segment size for TCP.
  * With an IP MSS of 576, this is 536,
Index: tcp_output.c
===
RCS file: /cvs/src/sys/netinet/tcp_output.c,v
retrieving revision 1.129
diff -u -p -r1.129 tcp_output.c
--- tcp_output.c25 Jan 2021 03:40:46 -  1.129
+++ tcp_output.c8 Feb 2021 17:53:07 -
@@ -203,7 +203,6 @@ tcp_output(struct tcpcb *tp)
int idle, sendalot = 0;
int i, sack_rxmit = 0;
struct sackhole *p;
-   int maxburst = TCP_MAXBURST;
 #ifdef TCP_SIGNATURE
unsigned int sigoff;
 #endif /* TCP_SIGNATURE */
@@ -1120,7 +1119,7 @@ out:
tp->last_ack_sent = tp->rcv_nxt;
tp->t_flags &= ~TF_ACKNOW;
TCP_TIMER_DISARM(tp, TCPT_DELACK);
-   if (sendalot && --maxburst)
+   if (sendalot)
goto again;
return (0);
 }



Re: diff: tcp ack improvement

2021-02-08 Thread Theo de Raadt
Claudio Jeker  wrote:

> On Mon, Feb 08, 2021 at 07:46:46PM +0100, Alexander Bluhm wrote:
> > On Mon, Feb 08, 2021 at 07:03:59PM +0100, Jan Klemkow wrote:
> > > On Mon, Feb 08, 2021 at 03:42:54PM +0100, Alexander Bluhm wrote:
> > > > On Wed, Feb 03, 2021 at 11:20:04AM +0100, Claudio Jeker wrote:
> > > > > Just commit it. OK claudio@
> > > > > If people see problems we can back it out again.
> > > > 
> > > > This has huge impact on TCP performance.
> > > > 
> > > > http://bluhm.genua.de/perform/results/2021-02-07T00%3A01%3A40Z/perform.html
> > > > 
> > > > For a single TCP connection between to OpenBSD boxes, througput
> > > > drops by 77% from 3.1 GBit/sec to 710 MBit/sec.  But with 100
> > > > parallel connections the througput over all increases by 5%.
> > > 
> > > For single connections our kernel is limited to send out 4 max TCP
> > > segments.  I don't see that, because I just measured with 10 and 30
> > > streams in parallel.
> > > 
> > > FreeBSD disabled it 20 yeas ago.
> > > https://github.com/freebsd/freebsd-src/commit/d912c694ee00de5ea0f46743295a0fc603cab562
> > 
> > TCP_MAXBURST was added together with SACK in rev 1.12 of tcp_output.c
> > to our code base.
> > 
> > 
> > revision 1.12
> > date: 1998/11/17 19:23:02;  author: provos;  state: Exp;  lines: +239 -14;
> > NewReno, SACK and FACK support for TCP, adapted from code for BSDI
> > by Hari Balakrishnan (h...@lcs.mit.edu), Tom Henderson 
> > (t...@cs.berkeley.edu)
> > and Venkat Padmanabhan (padma...@cs.berkeley.edu) as part of the
> > Daedalus research group at the University of California,
> > (http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
> > at the Center for Information Technology Integration (citi.umich.edu)]
> > 
> > 
> > > I would suggest to remove the whole feature.
> > 
> > Sending 4 segments per call to tcp_output() cannot scale.  Bandwith
> > increases, window size grows, but segment size is 1500 for decades.
> > 
> > With this diff on top of jan's delay ACK behavior I get 4.1 GBit/sec
> > over a single TCP connection using tcpbench -S100.  Before both
> > changes it was only 3.0.
> > 
> > I recommend removing TCP_MAXBURST like FreeBSD did.
> > 
> 
> I agree that this maxburst limit is no longer adequate. TCP New Reno
> RFC6582 has the following:
> 
>In Section 3.2, step 3 above, it is noted that implementations should
>take measures to avoid a possible burst of data when leaving fast
>recovery, in case the amount of new data that the sender is eligible
>to send due to the new value of the congestion window is large.  This
>can arise during NewReno when ACKs are lost or treated as pure window
>updates, thereby causing the sender to underestimate the number of
>new segments that can be sent during the recovery procedure.
>Specifically, bursts can occur when the FlightSize is much less than
>the new congestion window when exiting from fast recovery.  One
>simple mechanism to avoid a burst of data when leaving fast recovery
>is to limit the number of data packets that can be sent in response
>to a single acknowledgment.  (This is known as "maxburst_" in ns-2
>[NS].)  Other possible mechanisms for avoiding bursts include rate-
>based pacing, or setting the slow start threshold to the resultant
>congestion window and then resetting the congestion window to
>FlightSize.  A recommendation on the general mechanism to avoid
>excessively bursty sending patterns is outside the scope of this
>document.
> 
> While I agree that bursts need to be limited I think the implementation of
> TCP_MAXBURST is bad. Since FreeBSD removed the code I guess nobody really
> ran into issues of additional packet loss because of the burts. So go
> ahead and remove it. OK claudio@

that makes sense.  ok deraadt



Re: diff: tcp ack improvement

2021-02-08 Thread Claudio Jeker
On Mon, Feb 08, 2021 at 07:46:46PM +0100, Alexander Bluhm wrote:
> On Mon, Feb 08, 2021 at 07:03:59PM +0100, Jan Klemkow wrote:
> > On Mon, Feb 08, 2021 at 03:42:54PM +0100, Alexander Bluhm wrote:
> > > On Wed, Feb 03, 2021 at 11:20:04AM +0100, Claudio Jeker wrote:
> > > > Just commit it. OK claudio@
> > > > If people see problems we can back it out again.
> > > 
> > > This has huge impact on TCP performance.
> > > 
> > > http://bluhm.genua.de/perform/results/2021-02-07T00%3A01%3A40Z/perform.html
> > > 
> > > For a single TCP connection between to OpenBSD boxes, througput
> > > drops by 77% from 3.1 GBit/sec to 710 MBit/sec.  But with 100
> > > parallel connections the througput over all increases by 5%.
> > 
> > For single connections our kernel is limited to send out 4 max TCP
> > segments.  I don't see that, because I just measured with 10 and 30
> > streams in parallel.
> > 
> > FreeBSD disabled it 20 yeas ago.
> > https://github.com/freebsd/freebsd-src/commit/d912c694ee00de5ea0f46743295a0fc603cab562
> 
> TCP_MAXBURST was added together with SACK in rev 1.12 of tcp_output.c
> to our code base.
> 
> 
> revision 1.12
> date: 1998/11/17 19:23:02;  author: provos;  state: Exp;  lines: +239 -14;
> NewReno, SACK and FACK support for TCP, adapted from code for BSDI
> by Hari Balakrishnan (h...@lcs.mit.edu), Tom Henderson (t...@cs.berkeley.edu)
> and Venkat Padmanabhan (padma...@cs.berkeley.edu) as part of the
> Daedalus research group at the University of California,
> (http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
> at the Center for Information Technology Integration (citi.umich.edu)]
> 
> 
> > I would suggest to remove the whole feature.
> 
> Sending 4 segments per call to tcp_output() cannot scale.  Bandwith
> increases, window size grows, but segment size is 1500 for decades.
> 
> With this diff on top of jan's delay ACK behavior I get 4.1 GBit/sec
> over a single TCP connection using tcpbench -S100.  Before both
> changes it was only 3.0.
> 
> I recommend removing TCP_MAXBURST like FreeBSD did.
> 

I agree that this maxburst limit is no longer adequate. TCP New Reno
RFC6582 has the following:

   In Section 3.2, step 3 above, it is noted that implementations should
   take measures to avoid a possible burst of data when leaving fast
   recovery, in case the amount of new data that the sender is eligible
   to send due to the new value of the congestion window is large.  This
   can arise during NewReno when ACKs are lost or treated as pure window
   updates, thereby causing the sender to underestimate the number of
   new segments that can be sent during the recovery procedure.
   Specifically, bursts can occur when the FlightSize is much less than
   the new congestion window when exiting from fast recovery.  One
   simple mechanism to avoid a burst of data when leaving fast recovery
   is to limit the number of data packets that can be sent in response
   to a single acknowledgment.  (This is known as "maxburst_" in ns-2
   [NS].)  Other possible mechanisms for avoiding bursts include rate-
   based pacing, or setting the slow start threshold to the resultant
   congestion window and then resetting the congestion window to
   FlightSize.  A recommendation on the general mechanism to avoid
   excessively bursty sending patterns is outside the scope of this
   document.

While I agree that bursts need to be limited I think the implementation of
TCP_MAXBURST is bad. Since FreeBSD removed the code I guess nobody really
ran into issues of additional packet loss because of the burts. So go
ahead and remove it. OK claudio@

-- 
:wq Claudio



Re: diff: tcp ack improvement

2021-02-08 Thread Alexander Bluhm
On Mon, Feb 08, 2021 at 07:03:59PM +0100, Jan Klemkow wrote:
> On Mon, Feb 08, 2021 at 03:42:54PM +0100, Alexander Bluhm wrote:
> > On Wed, Feb 03, 2021 at 11:20:04AM +0100, Claudio Jeker wrote:
> > > Just commit it. OK claudio@
> > > If people see problems we can back it out again.
> > 
> > This has huge impact on TCP performance.
> > 
> > http://bluhm.genua.de/perform/results/2021-02-07T00%3A01%3A40Z/perform.html
> > 
> > For a single TCP connection between to OpenBSD boxes, througput
> > drops by 77% from 3.1 GBit/sec to 710 MBit/sec.  But with 100
> > parallel connections the througput over all increases by 5%.
> 
> For single connections our kernel is limited to send out 4 max TCP
> segments.  I don't see that, because I just measured with 10 and 30
> streams in parallel.
> 
> FreeBSD disabled it 20 yeas ago.
> https://github.com/freebsd/freebsd-src/commit/d912c694ee00de5ea0f46743295a0fc603cab562

TCP_MAXBURST was added together with SACK in rev 1.12 of tcp_output.c
to our code base.


revision 1.12
date: 1998/11/17 19:23:02;  author: provos;  state: Exp;  lines: +239 -14;
NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (h...@lcs.mit.edu), Tom Henderson (t...@cs.berkeley.edu)
and Venkat Padmanabhan (padma...@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


> I would suggest to remove the whole feature.

Sending 4 segments per call to tcp_output() cannot scale.  Bandwith
increases, window size grows, but segment size is 1500 for decades.

With this diff on top of jan's delay ACK behavior I get 4.1 GBit/sec
over a single TCP connection using tcpbench -S100.  Before both
changes it was only 3.0.

I recommend removing TCP_MAXBURST like FreeBSD did.

bluhm

> Index: tcp.h
> ===
> RCS file: /cvs/src/sys/netinet/tcp.h,v
> retrieving revision 1.21
> diff -u -p -r1.21 tcp.h
> --- tcp.h 10 Jul 2019 18:45:31 -  1.21
> +++ tcp.h 8 Feb 2021 17:52:38 -
> @@ -105,8 +105,6 @@ struct tcphdr {
>  #define  TCP_MAX_SACK3   /* Max # SACKs sent in any segment */
>  #define  TCP_SACKHOLE_LIMIT 128  /* Max # SACK holes per connection */
>  
> -#define  TCP_MAXBURST4   /* Max # packets after leaving Fast 
> Rxmit */
> -
>  /*
>   * Default maximum segment size for TCP.
>   * With an IP MSS of 576, this is 536,
> Index: tcp_output.c
> ===
> RCS file: /cvs/src/sys/netinet/tcp_output.c,v
> retrieving revision 1.129
> diff -u -p -r1.129 tcp_output.c
> --- tcp_output.c  25 Jan 2021 03:40:46 -  1.129
> +++ tcp_output.c  8 Feb 2021 17:53:07 -
> @@ -203,7 +203,6 @@ tcp_output(struct tcpcb *tp)
>   int idle, sendalot = 0;
>   int i, sack_rxmit = 0;
>   struct sackhole *p;
> - int maxburst = TCP_MAXBURST;
>  #ifdef TCP_SIGNATURE
>   unsigned int sigoff;
>  #endif /* TCP_SIGNATURE */
> @@ -1120,7 +1119,7 @@ out:
>   tp->last_ack_sent = tp->rcv_nxt;
>   tp->t_flags &= ~TF_ACKNOW;
>   TCP_TIMER_DISARM(tp, TCPT_DELACK);
> - if (sendalot && --maxburst)
> + if (sendalot)
>   goto again;
>   return (0);
>  }



Re: diff: tcp ack improvement

2021-02-08 Thread Theo de Raadt
Yes it is unacceptable.

Alexander Bluhm  wrote:

> On Wed, Feb 03, 2021 at 11:20:04AM +0100, Claudio Jeker wrote:
> > Just commit it. OK claudio@
> > If people see problems we can back it out again.
> 
> This has huge impact on TCP performance.
> 
> http://bluhm.genua.de/perform/results/2021-02-07T00%3A01%3A40Z/perform.html
> 
> For a single TCP connection between to OpenBSD boxes, througput
> drops by 77% from 3.1 GBit/sec to 710 MBit/sec.  But with 100
> parallel connections the througput over all increases by 5%.
> 
> Sending from Linux to OpenBSD increases by 72% from 3.5 GBit/sec
> to 6.0 GBit/sec.
> 
> Socket splicing from Linux to Linux via OpenBSD with 10 parallel
> TCP connections increases by 25% from 3.5 GBit/sec from 1.8 GBit/sec
> to 2.3 GBit/sec.
> 
> It seems that sending less ACK packets improves performance if the
> machine is limited by the CPU.  But the TCP stack of OpenBSD is
> sending 77% percent slower, if it does not receive enough ACKs.
> This has no impact if we are measuring the combined througput of
> many parallel connections.  The Linux packet sending algorithm looks
> unaffected by our more delayed acks.
> 
> I think 77% slower between two OpenBSDs is not acceptable.
> Do others see that, too?
> 
> bluhm
> 



Re: diff: tcp ack improvement

2021-02-08 Thread Alexander Bluhm
On Wed, Feb 03, 2021 at 11:20:04AM +0100, Claudio Jeker wrote:
> Just commit it. OK claudio@
> If people see problems we can back it out again.

This has huge impact on TCP performance.

http://bluhm.genua.de/perform/results/2021-02-07T00%3A01%3A40Z/perform.html

For a single TCP connection between to OpenBSD boxes, througput
drops by 77% from 3.1 GBit/sec to 710 MBit/sec.  But with 100
parallel connections the througput over all increases by 5%.

Sending from Linux to OpenBSD increases by 72% from 3.5 GBit/sec
to 6.0 GBit/sec.

Socket splicing from Linux to Linux via OpenBSD with 10 parallel
TCP connections increases by 25% from 3.5 GBit/sec from 1.8 GBit/sec
to 2.3 GBit/sec.

It seems that sending less ACK packets improves performance if the
machine is limited by the CPU.  But the TCP stack of OpenBSD is
sending 77% percent slower, if it does not receive enough ACKs.
This has no impact if we are measuring the combined througput of
many parallel connections.  The Linux packet sending algorithm looks
unaffected by our more delayed acks.

I think 77% slower between two OpenBSDs is not acceptable.
Do others see that, too?

bluhm



Re: diff: tcp ack improvement

2021-02-03 Thread Jan Klemkow
On Tue, Jan 05, 2021 at 10:30:33AM +0100, Claudio Jeker wrote:
> On Tue, Jan 05, 2021 at 10:16:04AM +0100, Jan Klemkow wrote:
> > On Wed, Dec 23, 2020 at 11:59:13AM +, Stuart Henderson wrote:
> > > On 2020/12/17 20:50, Jan Klemkow wrote:
> > > > ping
> > > > 
> > > > On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > > > > bluhm and I make some network performance measurements and kernel
> > > > > profiling.
> > > 
> > > I've been running this on my workstation since you sent it out - lots
> > > of long-running ssh connections, hourly reposync, daily rsync of base
> > > snapshots.
> > > 
> > > I don't know enough about TCP stack behaviour to really give a meaningful
> > > OK, but certainly not seeing any problems with it.
> > 
> > Thanks, Stuart.  Has someone else tested this diff?  Or, are there some
> > opinions or objections about it?  Even bike-shedding is welcome :-)
> 
> From my memory TCP uses the ACKs on startup to increase the send window
> and so your diff could slow down the initial startup. Not sure if that
> matters actually. It can have some impact if userland reads in big blocks
> at infrequent intervals since then the ACK clock slows down.
> 
> I guess to get converage it would be best to commit this and then monitor
> the lists for possible slowdowns.

It there a way to commit this, or to test the diff in snapshots?

bye,
Jan
 
> > > > > Setup:Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> 
> > > > > Linux (iperf)
> > > > > 
> > > > > We figured out, that the kernel uses a huge amount of processing time
> > > > > for sending ACKs to the sender on the receiving interface.  After
> > > > > receiving a data segment, we send our two ACK.  The first one in
> > > > > tcp_input() direct after receiving.  The second ACK is send out, after
> > > > > the userland or the sosplice task read some data out of the socket
> > > > > buffer.
> > > > > 
> > > > > The fist ACK in tcp_input() is called after receiving every other data
> > > > > segment like it is discribed in RFC1122:
> > > > > 
> > > > >   4.2.3.2  When to Send an ACK Segment
> > > > >   A TCP SHOULD implement a delayed ACK, but an ACK should
> > > > >   not be excessively delayed; in particular, the delay
> > > > >   MUST be less than 0.5 seconds, and in a stream of
> > > > >   full-sized segments there SHOULD be an ACK for at least
> > > > >   every second segment.
> > > > > 
> > > > > This advice is based on the paper "Congestion Avoidance and Control":
> > > > > 
> > > > >   4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > > > >   The 8 KBps senders were talking to 4.3+BSD receivers
> > > > >   which would delay an ack for atmost one packet (because
> > > > >   of an ack’s clock’ role, the authors believe that the
> > > > >   minimum ack frequency should be every other packet).
> > > > > 
> > > > > Sending the first ACK (on every other packet) coasts us too much
> > > > > processing time.  Thus, we run into a full socket buffer earlier.  The
> > > > > first ACK just acknowledges the received data, but does not update the
> > > > > window.  The second ACK, caused by the socket buffer reader, also
> > > > > acknowledges the data and also updates the window.  So, the second 
> > > > > ACK,
> > > > > is much more worth for a fast packet processing than the fist one.
> > > > > 
> > > > > The performance improvement is between 33% with splicing and 20% 
> > > > > without
> > > > > splice:
> > > > > 
> > > > >   splicingrelaying
> > > > > 
> > > > >   current 3.1 GBit/s  2.6 GBit/s
> > > > >   w/o first ack   4.1 GBit/s  3.1 GBit/s
> > > > > 
> > > > > As far as I understand the implementation of other operating systems:
> > > > > Linux has implement a custom TCP_QUICKACK socket option, to turn this
> > > > > kind of feature on and off.  FreeBSD and NetBSD sill depend on it, 
> > > > > when
> > > > > using the New Reno implementation.
> > > > > 
> > > > > The following diff turns off the direct ACK on every other segment.  
> > > > > We
> > > > > are running this diff in production on our own machines at genua and 
> > > > > on
> > > > > our products for several month, now.  We don't noticed any problems,
> > > > > even with interactive network sessions (ssh) nor with bulk traffic.
> > > > > 
> > > > > Another solution could be a sysctl(3) or an additional socket option,
> > > > > similar to Linux, to control this behavior per socket or system wide.
> > > > > Also, a counter to ACK every 3rd, 4th... data segment could beat the
> > > > > problem.
> > > > > 
> > > > > bye,
> > > > > Jan
> > > > > 
> > > > > Index: netinet/tcp_input.c
> > > > > ===
> > > > > RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> > > > > retrieving revision 1.365
> > > > > diff -u -p -r1.365 tcp_input.c
> > > > > --- 

Re: diff: tcp ack improvement

2021-02-03 Thread Claudio Jeker
On Wed, Feb 03, 2021 at 10:56:38AM +0100, Jan Klemkow wrote:
> On Tue, Jan 05, 2021 at 10:30:33AM +0100, Claudio Jeker wrote:
> > On Tue, Jan 05, 2021 at 10:16:04AM +0100, Jan Klemkow wrote:
> > > On Wed, Dec 23, 2020 at 11:59:13AM +, Stuart Henderson wrote:
> > > > On 2020/12/17 20:50, Jan Klemkow wrote:
> > > > > ping
> > > > > 
> > > > > On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > > > > > bluhm and I make some network performance measurements and kernel
> > > > > > profiling.
> > > > 
> > > > I've been running this on my workstation since you sent it out - lots
> > > > of long-running ssh connections, hourly reposync, daily rsync of base
> > > > snapshots.
> > > > 
> > > > I don't know enough about TCP stack behaviour to really give a 
> > > > meaningful
> > > > OK, but certainly not seeing any problems with it.
> > > 
> > > Thanks, Stuart.  Has someone else tested this diff?  Or, are there some
> > > opinions or objections about it?  Even bike-shedding is welcome :-)
> > 
> > From my memory TCP uses the ACKs on startup to increase the send window
> > and so your diff could slow down the initial startup. Not sure if that
> > matters actually. It can have some impact if userland reads in big blocks
> > at infrequent intervals since then the ACK clock slows down.
> > 
> > I guess to get converage it would be best to commit this and then monitor
> > the lists for possible slowdowns.
> 
> It there a way to commit this, or to test the diff in snapshots?

Just commit it. OK claudio@
If people see problems we can back it out again.
 
> bye,
> Jan
>  
> > > > > > Setup:  Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> 
> > > > > > Linux (iperf)
> > > > > > 
> > > > > > We figured out, that the kernel uses a huge amount of processing 
> > > > > > time
> > > > > > for sending ACKs to the sender on the receiving interface.  After
> > > > > > receiving a data segment, we send our two ACK.  The first one in
> > > > > > tcp_input() direct after receiving.  The second ACK is send out, 
> > > > > > after
> > > > > > the userland or the sosplice task read some data out of the socket
> > > > > > buffer.
> > > > > > 
> > > > > > The fist ACK in tcp_input() is called after receiving every other 
> > > > > > data
> > > > > > segment like it is discribed in RFC1122:
> > > > > > 
> > > > > > 4.2.3.2  When to Send an ACK Segment
> > > > > > A TCP SHOULD implement a delayed ACK, but an ACK should
> > > > > > not be excessively delayed; in particular, the delay
> > > > > > MUST be less than 0.5 seconds, and in a stream of
> > > > > > full-sized segments there SHOULD be an ACK for at least
> > > > > > every second segment.
> > > > > > 
> > > > > > This advice is based on the paper "Congestion Avoidance and 
> > > > > > Control":
> > > > > > 
> > > > > > 4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > > > > > The 8 KBps senders were talking to 4.3+BSD receivers
> > > > > > which would delay an ack for atmost one packet (because
> > > > > > of an ack’s clock’ role, the authors believe that the
> > > > > > minimum ack frequency should be every other packet).
> > > > > > 
> > > > > > Sending the first ACK (on every other packet) coasts us too much
> > > > > > processing time.  Thus, we run into a full socket buffer earlier.  
> > > > > > The
> > > > > > first ACK just acknowledges the received data, but does not update 
> > > > > > the
> > > > > > window.  The second ACK, caused by the socket buffer reader, also
> > > > > > acknowledges the data and also updates the window.  So, the second 
> > > > > > ACK,
> > > > > > is much more worth for a fast packet processing than the fist one.
> > > > > > 
> > > > > > The performance improvement is between 33% with splicing and 20% 
> > > > > > without
> > > > > > splice:
> > > > > > 
> > > > > > splicingrelaying
> > > > > > 
> > > > > > current 3.1 GBit/s  2.6 GBit/s
> > > > > > w/o first ack   4.1 GBit/s  3.1 GBit/s
> > > > > > 
> > > > > > As far as I understand the implementation of other operating 
> > > > > > systems:
> > > > > > Linux has implement a custom TCP_QUICKACK socket option, to turn 
> > > > > > this
> > > > > > kind of feature on and off.  FreeBSD and NetBSD sill depend on it, 
> > > > > > when
> > > > > > using the New Reno implementation.
> > > > > > 
> > > > > > The following diff turns off the direct ACK on every other segment. 
> > > > > >  We
> > > > > > are running this diff in production on our own machines at genua 
> > > > > > and on
> > > > > > our products for several month, now.  We don't noticed any problems,
> > > > > > even with interactive network sessions (ssh) nor with bulk traffic.
> > > > > > 
> > > > > > Another solution could be a sysctl(3) or an additional socket 
> > > > > > option,
> > > > > > similar to Linux, to control this behavior per socket or 

Re: diff: tcp ack improvement

2021-01-05 Thread Jan Klemkow
On Wed, Dec 23, 2020 at 11:59:13AM +, Stuart Henderson wrote:
> On 2020/12/17 20:50, Jan Klemkow wrote:
> > ping
> > 
> > On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > > bluhm and I make some network performance measurements and kernel
> > > profiling.
> 
> I've been running this on my workstation since you sent it out - lots
> of long-running ssh connections, hourly reposync, daily rsync of base
> snapshots.
> 
> I don't know enough about TCP stack behaviour to really give a meaningful
> OK, but certainly not seeing any problems with it.

Thanks, Stuart.  Has someone else tested this diff?  Or, are there some
opinions or objections about it?  Even bike-shedding is welcome :-)

Thanks,
Jan

> > > Setup:Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> > > 
> > > We figured out, that the kernel uses a huge amount of processing time
> > > for sending ACKs to the sender on the receiving interface.  After
> > > receiving a data segment, we send our two ACK.  The first one in
> > > tcp_input() direct after receiving.  The second ACK is send out, after
> > > the userland or the sosplice task read some data out of the socket
> > > buffer.
> > > 
> > > The fist ACK in tcp_input() is called after receiving every other data
> > > segment like it is discribed in RFC1122:
> > > 
> > >   4.2.3.2  When to Send an ACK Segment
> > >   A TCP SHOULD implement a delayed ACK, but an ACK should
> > >   not be excessively delayed; in particular, the delay
> > >   MUST be less than 0.5 seconds, and in a stream of
> > >   full-sized segments there SHOULD be an ACK for at least
> > >   every second segment.
> > > 
> > > This advice is based on the paper "Congestion Avoidance and Control":
> > > 
> > >   4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > >   The 8 KBps senders were talking to 4.3+BSD receivers
> > >   which would delay an ack for atmost one packet (because
> > >   of an ack’s clock’ role, the authors believe that the
> > >   minimum ack frequency should be every other packet).
> > > 
> > > Sending the first ACK (on every other packet) coasts us too much
> > > processing time.  Thus, we run into a full socket buffer earlier.  The
> > > first ACK just acknowledges the received data, but does not update the
> > > window.  The second ACK, caused by the socket buffer reader, also
> > > acknowledges the data and also updates the window.  So, the second ACK,
> > > is much more worth for a fast packet processing than the fist one.
> > > 
> > > The performance improvement is between 33% with splicing and 20% without
> > > splice:
> > > 
> > >   splicingrelaying
> > > 
> > >   current 3.1 GBit/s  2.6 GBit/s
> > >   w/o first ack   4.1 GBit/s  3.1 GBit/s
> > > 
> > > As far as I understand the implementation of other operating systems:
> > > Linux has implement a custom TCP_QUICKACK socket option, to turn this
> > > kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
> > > using the New Reno implementation.
> > > 
> > > The following diff turns off the direct ACK on every other segment.  We
> > > are running this diff in production on our own machines at genua and on
> > > our products for several month, now.  We don't noticed any problems,
> > > even with interactive network sessions (ssh) nor with bulk traffic.
> > > 
> > > Another solution could be a sysctl(3) or an additional socket option,
> > > similar to Linux, to control this behavior per socket or system wide.
> > > Also, a counter to ACK every 3rd, 4th... data segment could beat the
> > > problem.
> > > 
> > > bye,
> > > Jan
> > > 
> > > Index: netinet/tcp_input.c
> > > ===
> > > RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> > > retrieving revision 1.365
> > > diff -u -p -r1.365 tcp_input.c
> > > --- netinet/tcp_input.c   19 Jun 2020 22:47:22 -  1.365
> > > +++ netinet/tcp_input.c   5 Nov 2020 23:00:34 -
> > > @@ -165,8 +165,8 @@ do { \
> > >  #endif
> > >  
> > >  /*
> > > - * Macro to compute ACK transmission behavior.  Delay the ACK unless
> > > - * we have already delayed an ACK (must send an ACK every two segments).
> > > + * Macro to compute ACK transmission behavior.  Delay the ACK until
> > > + * a read from the socket buffer or the delayed ACK timer causes one.
> > >   * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
> > >   * option is enabled or when the packet is coming from a loopback
> > >   * interface.
> > > @@ -176,8 +176,7 @@ do { \
> > >   struct ifnet *ifp = NULL; \
> > >   if (m && (m->m_flags & M_PKTHDR)) \
> > >   ifp = if_get(m->m_pkthdr.ph_ifidx); \
> > > - if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
> > > - (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > > + if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > >   (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
> > 

Re: diff: tcp ack improvement

2021-01-05 Thread Claudio Jeker
On Tue, Jan 05, 2021 at 10:16:04AM +0100, Jan Klemkow wrote:
> On Wed, Dec 23, 2020 at 11:59:13AM +, Stuart Henderson wrote:
> > On 2020/12/17 20:50, Jan Klemkow wrote:
> > > ping
> > > 
> > > On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > > > bluhm and I make some network performance measurements and kernel
> > > > profiling.
> > 
> > I've been running this on my workstation since you sent it out - lots
> > of long-running ssh connections, hourly reposync, daily rsync of base
> > snapshots.
> > 
> > I don't know enough about TCP stack behaviour to really give a meaningful
> > OK, but certainly not seeing any problems with it.
> 
> Thanks, Stuart.  Has someone else tested this diff?  Or, are there some
> opinions or objections about it?  Even bike-shedding is welcome :-)

>From my memory TCP uses the ACKs on startup to increase the send window
and so your diff could slow down the initial startup. Not sure if that
matters actually. It can have some impact if userland reads in big blocks
at infrequent intervals since then the ACK clock slows down.

I guess to get converage it would be best to commit this and then monitor
the lists for possible slowdowns.
 
> Thanks,
> Jan
> 
> > > > Setup:  Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> > > > 
> > > > We figured out, that the kernel uses a huge amount of processing time
> > > > for sending ACKs to the sender on the receiving interface.  After
> > > > receiving a data segment, we send our two ACK.  The first one in
> > > > tcp_input() direct after receiving.  The second ACK is send out, after
> > > > the userland or the sosplice task read some data out of the socket
> > > > buffer.
> > > > 
> > > > The fist ACK in tcp_input() is called after receiving every other data
> > > > segment like it is discribed in RFC1122:
> > > > 
> > > > 4.2.3.2  When to Send an ACK Segment
> > > > A TCP SHOULD implement a delayed ACK, but an ACK should
> > > > not be excessively delayed; in particular, the delay
> > > > MUST be less than 0.5 seconds, and in a stream of
> > > > full-sized segments there SHOULD be an ACK for at least
> > > > every second segment.
> > > > 
> > > > This advice is based on the paper "Congestion Avoidance and Control":
> > > > 
> > > > 4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > > > The 8 KBps senders were talking to 4.3+BSD receivers
> > > > which would delay an ack for atmost one packet (because
> > > > of an ack’s clock’ role, the authors believe that the
> > > > minimum ack frequency should be every other packet).
> > > > 
> > > > Sending the first ACK (on every other packet) coasts us too much
> > > > processing time.  Thus, we run into a full socket buffer earlier.  The
> > > > first ACK just acknowledges the received data, but does not update the
> > > > window.  The second ACK, caused by the socket buffer reader, also
> > > > acknowledges the data and also updates the window.  So, the second ACK,
> > > > is much more worth for a fast packet processing than the fist one.
> > > > 
> > > > The performance improvement is between 33% with splicing and 20% without
> > > > splice:
> > > > 
> > > > splicingrelaying
> > > > 
> > > > current 3.1 GBit/s  2.6 GBit/s
> > > > w/o first ack   4.1 GBit/s  3.1 GBit/s
> > > > 
> > > > As far as I understand the implementation of other operating systems:
> > > > Linux has implement a custom TCP_QUICKACK socket option, to turn this
> > > > kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
> > > > using the New Reno implementation.
> > > > 
> > > > The following diff turns off the direct ACK on every other segment.  We
> > > > are running this diff in production on our own machines at genua and on
> > > > our products for several month, now.  We don't noticed any problems,
> > > > even with interactive network sessions (ssh) nor with bulk traffic.
> > > > 
> > > > Another solution could be a sysctl(3) or an additional socket option,
> > > > similar to Linux, to control this behavior per socket or system wide.
> > > > Also, a counter to ACK every 3rd, 4th... data segment could beat the
> > > > problem.
> > > > 
> > > > bye,
> > > > Jan
> > > > 
> > > > Index: netinet/tcp_input.c
> > > > ===
> > > > RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> > > > retrieving revision 1.365
> > > > diff -u -p -r1.365 tcp_input.c
> > > > --- netinet/tcp_input.c 19 Jun 2020 22:47:22 -  1.365
> > > > +++ netinet/tcp_input.c 5 Nov 2020 23:00:34 -
> > > > @@ -165,8 +165,8 @@ do { \
> > > >  #endif
> > > >  
> > > >  /*
> > > > - * Macro to compute ACK transmission behavior.  Delay the ACK unless
> > > > - * we have already delayed an ACK (must send an ACK every two 
> > > > 

Re: diff: tcp ack improvement

2020-12-23 Thread Stuart Henderson
On 2020/12/17 20:50, Jan Klemkow wrote:
> ping
> 
> On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > Hi,
> > 
> > bluhm and I make some network performance measurements and kernel
> > profiling.

I've been running this on my workstation since you sent it out - lots
of long-running ssh connections, hourly reposync, daily rsync of base
snapshots.

I don't know enough about TCP stack behaviour to really give a meaningful
OK, but certainly not seeing any problems with it.

> > Setup:  Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> > 
> > We figured out, that the kernel uses a huge amount of processing time
> > for sending ACKs to the sender on the receiving interface.  After
> > receiving a data segment, we send our two ACK.  The first one in
> > tcp_input() direct after receiving.  The second ACK is send out, after
> > the userland or the sosplice task read some data out of the socket
> > buffer.
> > 
> > The fist ACK in tcp_input() is called after receiving every other data
> > segment like it is discribed in RFC1122:
> > 
> > 4.2.3.2  When to Send an ACK Segment
> > A TCP SHOULD implement a delayed ACK, but an ACK should
> > not be excessively delayed; in particular, the delay
> > MUST be less than 0.5 seconds, and in a stream of
> > full-sized segments there SHOULD be an ACK for at least
> > every second segment.
> > 
> > This advice is based on the paper "Congestion Avoidance and Control":
> > 
> > 4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > The 8 KBps senders were talking to 4.3+BSD receivers
> > which would delay an ack for atmost one packet (because
> > of an ack’s clock’ role, the authors believe that the
> > minimum ack frequency should be every other packet).
> > 
> > Sending the first ACK (on every other packet) coasts us too much
> > processing time.  Thus, we run into a full socket buffer earlier.  The
> > first ACK just acknowledges the received data, but does not update the
> > window.  The second ACK, caused by the socket buffer reader, also
> > acknowledges the data and also updates the window.  So, the second ACK,
> > is much more worth for a fast packet processing than the fist one.
> > 
> > The performance improvement is between 33% with splicing and 20% without
> > splice:
> > 
> > splicingrelaying
> > 
> > current 3.1 GBit/s  2.6 GBit/s
> > w/o first ack   4.1 GBit/s  3.1 GBit/s
> > 
> > As far as I understand the implementation of other operating systems:
> > Linux has implement a custom TCP_QUICKACK socket option, to turn this
> > kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
> > using the New Reno implementation.
> > 
> > The following diff turns off the direct ACK on every other segment.  We
> > are running this diff in production on our own machines at genua and on
> > our products for several month, now.  We don't noticed any problems,
> > even with interactive network sessions (ssh) nor with bulk traffic.
> > 
> > Another solution could be a sysctl(3) or an additional socket option,
> > similar to Linux, to control this behavior per socket or system wide.
> > Also, a counter to ACK every 3rd, 4th... data segment could beat the
> > problem.
> > 
> > bye,
> > Jan
> > 
> > Index: netinet/tcp_input.c
> > ===
> > RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> > retrieving revision 1.365
> > diff -u -p -r1.365 tcp_input.c
> > --- netinet/tcp_input.c 19 Jun 2020 22:47:22 -  1.365
> > +++ netinet/tcp_input.c 5 Nov 2020 23:00:34 -
> > @@ -165,8 +165,8 @@ do { \
> >  #endif
> >  
> >  /*
> > - * Macro to compute ACK transmission behavior.  Delay the ACK unless
> > - * we have already delayed an ACK (must send an ACK every two segments).
> > + * Macro to compute ACK transmission behavior.  Delay the ACK until
> > + * a read from the socket buffer or the delayed ACK timer causes one.
> >   * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
> >   * option is enabled or when the packet is coming from a loopback
> >   * interface.
> > @@ -176,8 +176,7 @@ do { \
> > struct ifnet *ifp = NULL; \
> > if (m && (m->m_flags & M_PKTHDR)) \
> > ifp = if_get(m->m_pkthdr.ph_ifidx); \
> > -   if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
> > -   (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > +   if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
> > tp->t_flags |= TF_ACKNOW; \
> > else \
> > 
> 



Re: diff: tcp ack improvement

2020-12-17 Thread Jan Klemkow
ping

On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> Hi,
> 
> bluhm and I make some network performance measurements and kernel
> profiling.
> 
> Setup:Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> 
> We figured out, that the kernel uses a huge amount of processing time
> for sending ACKs to the sender on the receiving interface.  After
> receiving a data segment, we send our two ACK.  The first one in
> tcp_input() direct after receiving.  The second ACK is send out, after
> the userland or the sosplice task read some data out of the socket
> buffer.
> 
> The fist ACK in tcp_input() is called after receiving every other data
> segment like it is discribed in RFC1122:
> 
>   4.2.3.2  When to Send an ACK Segment
>   A TCP SHOULD implement a delayed ACK, but an ACK should
>   not be excessively delayed; in particular, the delay
>   MUST be less than 0.5 seconds, and in a stream of
>   full-sized segments there SHOULD be an ACK for at least
>   every second segment.
> 
> This advice is based on the paper "Congestion Avoidance and Control":
> 
>   4 THE GATEWAY SIDE OF CONGESTION CONTROL
>   The 8 KBps senders were talking to 4.3+BSD receivers
>   which would delay an ack for atmost one packet (because
>   of an ack’s clock’ role, the authors believe that the
>   minimum ack frequency should be every other packet).
> 
> Sending the first ACK (on every other packet) coasts us too much
> processing time.  Thus, we run into a full socket buffer earlier.  The
> first ACK just acknowledges the received data, but does not update the
> window.  The second ACK, caused by the socket buffer reader, also
> acknowledges the data and also updates the window.  So, the second ACK,
> is much more worth for a fast packet processing than the fist one.
> 
> The performance improvement is between 33% with splicing and 20% without
> splice:
> 
>   splicingrelaying
> 
>   current 3.1 GBit/s  2.6 GBit/s
>   w/o first ack   4.1 GBit/s  3.1 GBit/s
> 
> As far as I understand the implementation of other operating systems:
> Linux has implement a custom TCP_QUICKACK socket option, to turn this
> kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
> using the New Reno implementation.
> 
> The following diff turns off the direct ACK on every other segment.  We
> are running this diff in production on our own machines at genua and on
> our products for several month, now.  We don't noticed any problems,
> even with interactive network sessions (ssh) nor with bulk traffic.
> 
> Another solution could be a sysctl(3) or an additional socket option,
> similar to Linux, to control this behavior per socket or system wide.
> Also, a counter to ACK every 3rd, 4th... data segment could beat the
> problem.
> 
> bye,
> Jan
> 
> Index: netinet/tcp_input.c
> ===
> RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> retrieving revision 1.365
> diff -u -p -r1.365 tcp_input.c
> --- netinet/tcp_input.c   19 Jun 2020 22:47:22 -  1.365
> +++ netinet/tcp_input.c   5 Nov 2020 23:00:34 -
> @@ -165,8 +165,8 @@ do { \
>  #endif
>  
>  /*
> - * Macro to compute ACK transmission behavior.  Delay the ACK unless
> - * we have already delayed an ACK (must send an ACK every two segments).
> + * Macro to compute ACK transmission behavior.  Delay the ACK until
> + * a read from the socket buffer or the delayed ACK timer causes one.
>   * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
>   * option is enabled or when the packet is coming from a loopback
>   * interface.
> @@ -176,8 +176,7 @@ do { \
>   struct ifnet *ifp = NULL; \
>   if (m && (m->m_flags & M_PKTHDR)) \
>   ifp = if_get(m->m_pkthdr.ph_ifidx); \
> - if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
> - (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> + if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
>   (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
>   tp->t_flags |= TF_ACKNOW; \
>   else \
> 



Re: diff: tcp ack improvement

2020-11-06 Thread Jan Klemkow
On Fri, Nov 06, 2020 at 08:03:36AM +0100, Otto Moerbeek wrote:
> On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:
> > bluhm and I make some network performance measurements and kernel
> > profiling.
> > 
> > Setup:  Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> > 
> > We figured out, that the kernel uses a huge amount of processing time
> > for sending ACKs to the sender on the receiving interface.  After
> > receiving a data segment, we send our two ACK.  The first one in
> > tcp_input() direct after receiving.  The second ACK is send out, after
> > the userland or the sosplice task read some data out of the socket
> > buffer.
> > 
> > The fist ACK in tcp_input() is called after receiving every other data
> > segment like it is discribed in RFC1122:
> > 
> > 4.2.3.2  When to Send an ACK Segment
> > A TCP SHOULD implement a delayed ACK, but an ACK should
> > not be excessively delayed; in particular, the delay
> > MUST be less than 0.5 seconds, and in a stream of
> > full-sized segments there SHOULD be an ACK for at least
> > every second segment.
> > 
> > This advice is based on the paper "Congestion Avoidance and Control":
> > 
> > 4 THE GATEWAY SIDE OF CONGESTION CONTROL
> > The 8 KBps senders were talking to 4.3+BSD receivers
> > which would delay an ack for atmost one packet (because
> > of an ack’s clock’ role, the authors believe that the
> > minimum ack frequency should be every other packet).
> > 
> > Sending the first ACK (on every other packet) coasts us too much
> > processing time.  Thus, we run into a full socket buffer earlier.  The
> > first ACK just acknowledges the received data, but does not update the
> > window.  The second ACK, caused by the socket buffer reader, also
> > acknowledges the data and also updates the window.  So, the second ACK,
> > is much more worth for a fast packet processing than the fist one.
> > 
> > The performance improvement is between 33% with splicing and 20% without
> > splice:
> > 
> > splicingrelaying
> > 
> > current 3.1 GBit/s  2.6 GBit/s
> > w/o first ack   4.1 GBit/s  3.1 GBit/s
> > 
> > As far as I understand the implementation of other operating systems:
> > Linux has implement a custom TCP_QUICKACK socket option, to turn this
> > kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
> > using the New Reno implementation.
> > 
> > The following diff turns off the direct ACK on every other segment.  We
> > are running this diff in production on our own machines at genua and on
> > our products for several month, now.  We don't noticed any problems,
> > even with interactive network sessions (ssh) nor with bulk traffic.
> > 
> > Another solution could be a sysctl(3) or an additional socket option,
> > similar to Linux, to control this behavior per socket or system wide.
> > Also, a counter to ACK every 3rd, 4th... data segment could beat the
> > problem.
> 
> I am wondering if you also looked at another scenario: the process
> reading the soecket is sleeping so the receive buffer fills up without
> any acks being sent. Won't that lead to a lot of retransmissions
> containing data?

No, an ACK will always send out after the delayed ACK timer is
triggered.  So, its shouldn't be a problem, when nobody on the system
reads from the socket buffer.

> > Index: netinet/tcp_input.c
> > ===
> > RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> > retrieving revision 1.365
> > diff -u -p -r1.365 tcp_input.c
> > --- netinet/tcp_input.c 19 Jun 2020 22:47:22 -  1.365
> > +++ netinet/tcp_input.c 5 Nov 2020 23:00:34 -
> > @@ -165,8 +165,8 @@ do { \
> >  #endif
> >  
> >  /*
> > - * Macro to compute ACK transmission behavior.  Delay the ACK unless
> > - * we have already delayed an ACK (must send an ACK every two segments).
> > + * Macro to compute ACK transmission behavior.  Delay the ACK until
> > + * a read from the socket buffer or the delayed ACK timer causes one.
> >   * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
> >   * option is enabled or when the packet is coming from a loopback
> >   * interface.
> > @@ -176,8 +176,7 @@ do { \
> > struct ifnet *ifp = NULL; \
> > if (m && (m->m_flags & M_PKTHDR)) \
> > ifp = if_get(m->m_pkthdr.ph_ifidx); \
> > -   if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
> > -   (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > +   if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> > (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
> > tp->t_flags |= TF_ACKNOW; \
> > else \
> > 
> 



Re: diff: tcp ack improvement

2020-11-05 Thread Otto Moerbeek
On Fri, Nov 06, 2020 at 01:10:52AM +0100, Jan Klemkow wrote:

> Hi,
> 
> bluhm and I make some network performance measurements and kernel
> profiling.
> 
> Setup:Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf)
> 
> We figured out, that the kernel uses a huge amount of processing time
> for sending ACKs to the sender on the receiving interface.  After
> receiving a data segment, we send our two ACK.  The first one in
> tcp_input() direct after receiving.  The second ACK is send out, after
> the userland or the sosplice task read some data out of the socket
> buffer.
> 
> The fist ACK in tcp_input() is called after receiving every other data
> segment like it is discribed in RFC1122:
> 
>   4.2.3.2  When to Send an ACK Segment
>   A TCP SHOULD implement a delayed ACK, but an ACK should
>   not be excessively delayed; in particular, the delay
>   MUST be less than 0.5 seconds, and in a stream of
>   full-sized segments there SHOULD be an ACK for at least
>   every second segment.
> 
> This advice is based on the paper "Congestion Avoidance and Control":
> 
>   4 THE GATEWAY SIDE OF CONGESTION CONTROL
>   The 8 KBps senders were talking to 4.3+BSD receivers
>   which would delay an ack for atmost one packet (because
>   of an ack’s clock’ role, the authors believe that the
>   minimum ack frequency should be every other packet).
> 
> Sending the first ACK (on every other packet) coasts us too much
> processing time.  Thus, we run into a full socket buffer earlier.  The
> first ACK just acknowledges the received data, but does not update the
> window.  The second ACK, caused by the socket buffer reader, also
> acknowledges the data and also updates the window.  So, the second ACK,
> is much more worth for a fast packet processing than the fist one.
> 
> The performance improvement is between 33% with splicing and 20% without
> splice:
> 
>   splicingrelaying
> 
>   current 3.1 GBit/s  2.6 GBit/s
>   w/o first ack   4.1 GBit/s  3.1 GBit/s
> 
> As far as I understand the implementation of other operating systems:
> Linux has implement a custom TCP_QUICKACK socket option, to turn this
> kind of feature on and off.  FreeBSD and NetBSD sill depend on it, when
> using the New Reno implementation.
> 
> The following diff turns off the direct ACK on every other segment.  We
> are running this diff in production on our own machines at genua and on
> our products for several month, now.  We don't noticed any problems,
> even with interactive network sessions (ssh) nor with bulk traffic.
> 
> Another solution could be a sysctl(3) or an additional socket option,
> similar to Linux, to control this behavior per socket or system wide.
> Also, a counter to ACK every 3rd, 4th... data segment could beat the
> problem.

I am wondering if you also looked at another scenario: the process
reading the soecket is sleeping so the receive buffer fills up without
any acks being sent. Won't that lead to a lot of retransmissions
containing data?

-Otto

> 
> bye,
> Jan
> 
> Index: netinet/tcp_input.c
> ===
> RCS file: /cvs/src/sys/netinet/tcp_input.c,v
> retrieving revision 1.365
> diff -u -p -r1.365 tcp_input.c
> --- netinet/tcp_input.c   19 Jun 2020 22:47:22 -  1.365
> +++ netinet/tcp_input.c   5 Nov 2020 23:00:34 -
> @@ -165,8 +165,8 @@ do { \
>  #endif
>  
>  /*
> - * Macro to compute ACK transmission behavior.  Delay the ACK unless
> - * we have already delayed an ACK (must send an ACK every two segments).
> + * Macro to compute ACK transmission behavior.  Delay the ACK until
> + * a read from the socket buffer or the delayed ACK timer causes one.
>   * We also ACK immediately if we received a PUSH and the ACK-on-PUSH
>   * option is enabled or when the packet is coming from a loopback
>   * interface.
> @@ -176,8 +176,7 @@ do { \
>   struct ifnet *ifp = NULL; \
>   if (m && (m->m_flags & M_PKTHDR)) \
>   ifp = if_get(m->m_pkthdr.ph_ifidx); \
> - if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \
> - (tcp_ack_on_push && (tiflags) & TH_PUSH) || \
> + if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \
>   (ifp && (ifp->if_flags & IFF_LOOPBACK))) \
>   tp->t_flags |= TF_ACKNOW; \
>   else \
>