Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Lawrence Brakmo
On 6/30/18, 5:26 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion 
(CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures
that at least for CWR packets the forgotten delayed ACK does not
matter. But for packets without CWR I believe the buggy "forgotten
delayed ACK" sequence above is still 

Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Neal Cardwell
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion (CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures
that at least for CWR packets the forgotten delayed ACK does not
matter. But for packets without CWR I believe the buggy "forgotten
delayed ACK" sequence above is still possible, and could still plague
DCTCP with high tail latencies in some cases. I suspect that after
your CWR-forces-ACK patch it may not be jumping out in performance
test results because (a) if there is no CWR involved then usually the
cwnd is high enough that missing delayed ACK does not cause a stall,
and (b) if there is a CWR involved then your patch forces an immediate
ACK. But AFAICT that forgotten delayed ACK bug is still there.

What do folks think?

neal


[PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-29 Thread Lawrence Brakmo
When have observed high tail latencies when using DCTCP for RPCs as
compared to using Cubic. For example, in one setup there are 2 hosts
sending to a 3rd one, with each sender having 3 flows (1 stream,
1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
table shows the 99% and 99.9% latencies for both Cubic and dctcp:

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
  1MB RPCs2.6ms   5.5ms 43ms  208ms
 10KB RPCs1.1ms   1.3ms 53ms  212ms

Looking at tcpdump traces showed that there are two causes for the
latency.  

  1) RTOs caused by the receiver sending a dup ACK and not ACKing
 the last (and only) packet sent.
  2) Delaying ACKs when the sender has a cwnd of 1, so everything
 pauses for the duration of the delayed ACK.

The first patch fixes the cause of the dup ACKs, not updating DCTCP
state when an ACK that was initially delayed has been sent with a
data packet.

The second patch insures that an ACK is sent immediately when a
CWR marked packet arrives.

With the patches the latencies for DCTCP now look like:

   dctcp 99%  dctcp 99.9% 
  1MB RPCs4.8ms   6.5ms
 10KB RPCs143us   184us

Note that while the 1MB RPCs tail latencies are higher than Cubic's,
the 10KB latencies are much smaller than Cubic's. These patches fix
issues on the receiver, but tcpdump traces indicate there is an
opportunity to also fix an issue at the sender that adds about 3ms
to the tail latencies.

The following trace shows the issue that tiggers an RTO (fixed by these 
patches):

   Host A sends the last packets of the request
   Host B receives them, and the last packet is marked with congestion (CE)
   Host B sends ACKs for packets not marked with congestion
   Host B sends data packet with reply and ACK for packet marked with
  congestion (TCP flag ECE)
   Host A receives ACKs with no ECE flag
   Host A receives data packet with ACK for the last packet of request
  and which has TCP ECE bit set
   Host A sends 1st data packet of the next request with TCP flag CWR
   Host B receives the packet (as seen in tcpdump at B), no CE flag
   Host B sends a dup ACK that also has the TCP ECE flag
   Host A RTO timer fires!
   Host A to send the next packet
   Host A receives an ACK for everything it has sent (i.e. Host B
  did receive 1st packet of request)
   Host A send more packets…

[PATCH net-next 1/2] tcp: notify when a delayed ack is sent
[PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

 net/ipv4/tcp_input.c  | 25 +
 net/ipv4/tcp_output.c |  2 ++
 2 files changed, 19 insertions(+), 8 deletions(-)