from:"Neal Cardwell"

Re: [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-28 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work
> before new packets are added the the backlog.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> This patch brings ~60% throughput increase on a receiver
> without GRO, but the spectacular gain is really on
> 1000x release_sock() latency reduction I have measured.
>
> Signed-off-by: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-28 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet  wrote:
>
> Neal pointed out that non sack flows might suffer from ACK compression
> added in the following patch ("tcp: implement coalescing on backlog queue")
>
> Instead of tweaking tcp_add_backlog() we can take into
> account how many ACK were coalesced, this information
> will be available in skb_shinfo(skb)->gso_segs
>
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work
> before new packets are added the the backlog.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> This patch brings ~60% throughput increase on a receiver
> without GRO, but the spectacular gain is really on
> 1000x release_sock() latency reduction I have measured.
>
> Signed-off-by: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
...
> +   if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> +   TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> +#ifdef CONFIG_TLS_DEVICE
> +   tail->decrypted != skb->decrypted ||
> +#endif
> +   thtail->doff != th->doff ||
> +   memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
> +   goto no_coalesce;
> +
> +   __skb_pull(skb, hdrlen);
> +   if (skb_try_coalesce(tail, skb, , )) {
> +   thtail->window = th->window;
> +
> +   TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
> +
> +   if (after(TCP_SKB_CB(skb)->ack_seq, 
> TCP_SKB_CB(tail)->ack_seq))
> +   TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +
> +   TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;

I wonder if technically perhaps the logic should skip coalescing if
the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are
coalesced, and some have urgent data and some do not, then the
TCP_FLAG_URG bit will be accumulated into the tail header, but there
will be no way to ensure the correct urgent offsets for the one or
more skbs with urgent data are passed along.

Thinking out loud, I guess if this is ECN/DCTCP and some ACKs have
TCP_FLAG_ECE and some don't, this will effectively have all ACKed
bytes be treated as ECN-marked. Probably OK, since if this coalescing
path is being hit the sender may be overloaded and slowing down might
be a good thing.

Otherwise, looks great to me. Thanks for doing this!

neal

Re: [PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Jean-Louis Dupond reported poor iscsi TCP receive performance
> that we tracked to backlog drops.
>
> Apparently we fail to send window updates reflecting the
> fact that we are under stress.
>
> Note that we might lack a proper window increase when
> backlog is fully processed, since __release_sock() clears
> sk->sk_backlog.len _after_ all skbs have been processed.
>
> This should not matter in practice. If we had a significant
> load through socket backlog, we are in a dangerous
> situation.
>
> Reported-by: Jean-Louis Dupond 
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks!

neal

Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Neal pointed out that non sack flows might suffer from ACK compression
> added in the following patch ("tcp: implement coalescing on backlog queue")
>
> Instead of tweaking tcp_add_backlog() we can take into
> account how many ACK were coalesced, this information
> will be available in skb_shinfo(skb)->gso_segs
>
> Signed-off-by: Eric Dumazet 
> ---
...
> @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, 
> bool is_dupack,
> /* A Reno DUPACK means new data in F-RTO step 2.b above are
>  * delivered. Lower inflight to clock out (re)tranmissions.
>  */
> -   if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
> -   tcp_add_reno_sack(sk);
> +   if (after(tp->snd_nxt, tp->high_seq))
> +   tcp_add_reno_sack(sk, num_dupack);
> else if (flag & FLAG_SND_UNA_ADVANCED)
> tcp_reset_reno_sack(tp);
> }

I think this probably should be checking num_dupack, something like:

+   if (after(tp->snd_nxt, tp->high_seq) && num_dupack)
+   tcp_add_reno_sack(sk, num_dupack);

If we don't check num_dupack, that seems to mean that after FRTO sends
the two new data packets (making snd_nxt after high_seq), the patch
would have a particular non-SACK FRTO loss recovery always go into the
"if" branch where we tcp_add_reno_sack() function, and we would never
have a chance to get to the "else" branch where we check if
FLAG_SND_UNA_ADVANCED and zero out the reno SACKs.

Otherwise the patch looks great to me. Thanks for doing this!

neal

Re: [PATCH v2 net-next 1/4] tcp: hint compiler about sack flows

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Tell the compiler that most TCP flows are using SACK these days.
>
> There is no need to add the unlikely() clause in tcp_is_reno(),
> the compiler is able to infer it.
>
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks!

neal

Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Neal Cardwell

On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.

Would this coalesce duplicate incoming ACK packets? Is there a risk
that this would eliminate incoming dupacks needed for fast recovery in
non-SACK connections? Perhaps pure ACKs should only be coalesced if
the ACK field is different?

neal

Re: [PATCH net-next 3/3] tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffies

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>
> tcp_tso_should_defer() first heuristic is to not defer
> if last send is "old enough".
>
> Its current implementation uses jiffies and its low granularity.
>
> TSO autodefer performance should not rely on kernel HZ :/
>
> After EDT conversion, we have state variables in nanoseconds that
> can allow us to properly implement the heuristic.
>
> This patch increases TSO chunk sizes on medium rate flows,
> especially when receivers do not use GRO or similar aggregation.
>
> It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
> behavior more uniform.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---

Nice. Thanks!

Acked-by: Neal Cardwell 

neal

Re: [PATCH net-next 2/3] tcp: refine tcp_tso_should_defer() after EDT adoption

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>
> tcp_tso_should_defer() last step tries to check if the probable
> next ACK packet is coming in less than half rtt.
>
> Problem is that the head->tstamp might be in the future,
> so we need to use signed arithmetics to avoid overflows.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Thanks!

Acked-by: Neal Cardwell 

neal

Re: [PATCH net-next 1/3] tcp: do not try to defer skbs with eor mark (MSG_EOR)

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>
> Applications using MSG_EOR are giving a strong hint to TCP stack :
>
> Subsequent sendmsg() can not append more bytes to skbs having
> the EOR mark.
>
> Do not try to TSO defer suchs skbs, there is really no hope.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 4 
>  1 file changed, 4 insertions(+)

Thanks!

Acked-by: Neal Cardwell 

neal

Re: [PATCH net-next] net_sched: sch_fq: add dctcp-like marking

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 12:11 PM Eric Dumazet  wrote:
>
> Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute")
>
> After EDT adoption, it became easier to implement DCTCP-like CE marking.
>
> In many cases, queues are not building in the network fabric but on
> the hosts themselves.
>
> If packets leaving fq missed their Earliest Departure Time by XXX usec,
> we mark them with ECN CE. This gives a feedback (after one RTT) to
> the sender to slow down and find better operating mode.
>
> Example :
>
> tc qd replace dev eth0 root fq ce_threshold 2.5ms
>
> Signed-off-by: Eric Dumazet 
> ---

Very nice! Thanks, Eric. :-)

Acked-by: Neal Cardwell 

neal

[PATCH net-next] tcp_bbr: update comments to reflect pacing_margin_percent

2018-11-08 Thread Neal Cardwell

Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 9277abdd822a0..0f497fc49c3fe 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -128,7 +128,12 @@ static const u32 bbr_probe_rtt_mode_ms = 200;
 /* Skip TSO below the following bandwidth (bits/sec): */
 static const int bbr_min_tso_rate = 120;
 
-/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. 
*/
+/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck.
+ * In order to help drive the network toward lower queues and low latency while
+ * maintaining high utilization, the average pacing rate aims to be slightly
+ * lower than the estimated bandwidth. This is an important aspect of the
+ * design.
+ */
 static const int bbr_pacing_margin_percent = 1;
 
 /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
@@ -247,13 +252,7 @@ static void bbr_init_pacing_rate_from_rtt(struct sock *sk)
sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);
 }
 
-/* Pace using current bw estimate and a gain factor. In order to help drive the
- * network toward lower queues while maintaining high utilization and low
- * latency, the average pacing rate aims to be slightly (~1%) lower than the
- * estimated bandwidth. This is an important aspect of the design. In this
- * implementation this slightly lower pacing rate is achieved implicitly by not
- * including link-layer headers in the packet size used for the pacing rate.
- */
+/* Pace using current bw estimate and a gain factor. */
 static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
 {
struct tcp_sock *tp = tcp_sk(sk);
-- 
2.19.1.930.g4563a0d9d0-goog

[PATCH net-next 2/2] tcp_bbr: centralize code to set gains

2018-10-16 Thread Neal Cardwell

Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the future) dynamically change the gain values and
ensure that the correct gain values are always used.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Priyaranjan Jha 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 40 ++--
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 4cc2223d2cd54..9277abdd822a0 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -521,8 +521,6 @@ static void bbr_advance_cycle_phase(struct sock *sk)
 
bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
bbr->cycle_mstamp = tp->delivered_mstamp;
-   bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT :
-   bbr_pacing_gain[bbr->cycle_idx];
 }
 
 /* Gain cycling: cycle pacing gain to converge to fair share of available bw. 
*/
@@ -540,8 +538,6 @@ static void bbr_reset_startup_mode(struct sock *sk)
struct bbr *bbr = inet_csk_ca(sk);
 
bbr->mode = BBR_STARTUP;
-   bbr->pacing_gain = bbr_high_gain;
-   bbr->cwnd_gain   = bbr_high_gain;
 }
 
 static void bbr_reset_probe_bw_mode(struct sock *sk)
@@ -549,8 +545,6 @@ static void bbr_reset_probe_bw_mode(struct sock *sk)
struct bbr *bbr = inet_csk_ca(sk);
 
bbr->mode = BBR_PROBE_BW;
-   bbr->pacing_gain = BBR_UNIT;
-   bbr->cwnd_gain = bbr_cwnd_gain;
bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
bbr_advance_cycle_phase(sk);/* flip to next phase of gain cycle */
 }
@@ -768,8 +762,6 @@ static void bbr_check_drain(struct sock *sk, const struct 
rate_sample *rs)
 
if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
bbr->mode = BBR_DRAIN;  /* drain queue we created */
-   bbr->pacing_gain = bbr_drain_gain;  /* pace slow to drain */
-   bbr->cwnd_gain = bbr_high_gain; /* maintain cwnd */
tcp_sk(sk)->snd_ssthresh =
bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT);
}   /* fall through to check if in-flight is already small: */
@@ -831,8 +823,6 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
!bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
-   bbr->pacing_gain = BBR_UNIT;
-   bbr->cwnd_gain = BBR_UNIT;
bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
bbr->probe_rtt_done_stamp = 0;
}
@@ -860,6 +850,35 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
bbr->idle_restart = 0;
 }
 
+static void bbr_update_gains(struct sock *sk)
+{
+   struct bbr *bbr = inet_csk_ca(sk);
+
+   switch (bbr->mode) {
+   case BBR_STARTUP:
+   bbr->pacing_gain = bbr_high_gain;
+   bbr->cwnd_gain   = bbr_high_gain;
+   break;
+   case BBR_DRAIN:
+   bbr->pacing_gain = bbr_drain_gain;  /* slow, to drain */
+   bbr->cwnd_gain   = bbr_high_gain;   /* keep cwnd */
+   break;
+   case BBR_PROBE_BW:
+   bbr->pacing_gain = (bbr->lt_use_bw ?
+   BBR_UNIT :
+   bbr_pacing_gain[bbr->cycle_idx]);
+   bbr->cwnd_gain   = bbr_cwnd_gain;
+   break;
+   case BBR_PROBE_RTT:
+   bbr->pacing_gain = BBR_UNIT;
+   bbr->cwnd_gain   = BBR_UNIT;
+   break;
+   default:
+   WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode);
+   break;
+   }
+}
+
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
 {
bbr_update_bw(sk, rs);
@@ -867,6 +886,7 @@ static void bbr_update_model(struct sock *sk, const struct 
rate_sample *rs)
bbr_check_full_bw_reached(sk, rs);
bbr_check_drain(sk, rs);
bbr_update_min_rtt(sk, rs);
+   bbr_update_gains(sk);
 }
 
 static void bbr_main(struct sock *sk, const struct rate_sample *rs)
-- 
2.19.1.331.ge82ca0e54c-goog

[PATCH net-next 0/2] tcp_bbr: TCP BBR changes for EDT pacing model

2018-10-16 Thread Neal Cardwell

Two small patches for TCP BBR to follow up with Eric's recent work to change
the TCP and fq pacing machinery to an "earliest departure time" (EDT) model:

- The first patch adjusts the TCP BBR logic to work with the new
  "earliest departure time" (EDT) pacing model.

- The second patch adjusts the TCP BBR logic to centralize the setting
  of gain values, to simplify the code and prepare for future changes.

Neal Cardwell (2):
  tcp_bbr: adjust TCP BBR for departure time pacing
  tcp_bbr: centralize code to set gains

 net/ipv4/tcp_bbr.c | 77 ++
 1 file changed, 65 insertions(+), 12 deletions(-)

-- 
2.19.1.331.ge82ca0e54c-goog

[PATCH net-next 1/2] tcp_bbr: adjust TCP BBR for departure time pacing

2018-10-16 Thread Neal Cardwell

Adjust TCP BBR for the new departure time pacing model in the recent
commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
departure time model").

With TSQ and pacing at lower layers, there are often several skbs
queued in the pacing layer, and thus there is less data "in the
network" than "in flight".

With departure time pacing at lower layers (e.g. fq or potential
future NICs), the data in the pacing layer now has a pre-scheduled
("baked-in") departure time that cannot be changed, even if the
congestion control algorithm decides to use a new pacing rate.

This means that there can be a non-trivial lag between when BBR makes
a pacing rate change and when the inter-skb pacing delays
change. After a pacing rate change, the number of packets in the
network can gradually evolve to be higher or lower, depending on
whether the sending rate is higher or lower than the delivery
rate. Thus ignoring this lag can cause significant overshoot, with the
flow ending up with too many or too few packets in the network.

This commit changes BBR to adapt its pacing rate based on the amount
of data in the network that it estimates has already been "baked in"
by previous departure time decisions. We estimate the number of our
packets that will be in the network at the earliest departure time
(EDT) for the next skb scheduled as:

   in_network_at_edt = inflight_at_edt - (EDT - now) * bw

If we're increasing the amount of data in the network ("in_network"),
then we want to know if the transmit of the EDT skb will push
in_network above the target, so our answer includes
bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
in_network, then we want to know if in_network will sink too low just
before the EDT transmit, so our answer does not include the segments
from the skb departing at EDT.

Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
differently? The in_network curve is a step function: in_network goes
up on transmits, and down on ACKs. To accurately predict when
in_network will go beyond our target value, this will happen on
different events, depending on whether we're concerned about
in_network potentially going too high or too low:

 o if pushing in_network up (pacing_gain > 1.0),
   then in_network goes above target upon a transmit event

 o if pushing in_network down (pacing_gain < 1.0),
   then in_network goes below target upon an ACK event

This commit changes the BBR state machine to use this estimated
"packets in network" value to make its decisions.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 37 +++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index b88081285fd17..4cc2223d2cd54 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int 
gain)
return cwnd;
 }
 
+/* With pacing at lower layers, there's often less data "in the network" than
+ * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq),
+ * we often have several skbs queued in the pacing layer with a pre-scheduled
+ * earliest departure time (EDT). BBR adapts its pacing rate based on the
+ * inflight level that it estimates has already been "baked in" by previous
+ * departure time decisions. We calculate a rough estimate of the number of our
+ * packets that might be in the network at the earliest departure time for the
+ * next skb scheduled:
+ *   in_network_at_edt = inflight_at_edt - (EDT - now) * bw
+ * If we're increasing inflight, then we want to know if the transmit of the
+ * EDT skb will push inflight above the target, so inflight_at_edt includes
+ * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight,
+ * then estimate if inflight will sink too low just before the EDT transmit.
+ */
+static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct bbr *bbr = inet_csk_ca(sk);
+   u64 now_ns, edt_ns, interval_us;
+   u32 interval_delivered, inflight_at_edt;
+
+   now_ns = tp->tcp_clock_cache;
+   edt_ns = max(tp->tcp_wstamp_ns, now_ns);
+   interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC);
+   interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE;
+   inflight_at_edt = inflight_now;
+   if (bbr->pacing_gain > BBR_UNIT)  /* increasing inflight */
+   inflight_at_edt += bbr_tso_segs_goal(sk);  /* include EDT skb */
+   if (interval_delivered >= inflight_at_edt)
+   return 0;
+   return inflight_at_edt - interval_delivered;
+}
+
 /* An optimization in BBR to reduce losses: On the first round of recovery, we
  * follow the packet

Re: Why not use all the syn queues? in the function "tcp_conn_request", I have some questions.

2018-09-08 Thread Neal Cardwell

On Sat, Sep 8, 2018 at 11:23 AM Ttttabcd  wrote:
>
> Thank you very much for your previous answer, sorry for the inconvenience.
>
> But now I want to ask you one more question.
>
> The question is why we need two variables to control the syn queue?
>
> The first is the "backlog" parameter of the "listen" system call that 
> controls the maximum length limit of the syn queue, it also controls the 
> accept queue.

By default, and essentially always in practice (AFAIK), Linux
installations enable syncookies. With syncookies, there is essentially
no limit on the syn queue, or number of incomplete passive connections
(as the man page you quoted notes). So in practice the listen()
parameter usually controls only the accept queue.

> The second is /proc/sys/net/ipv4/tcp_max_syn_backlog, which also controls the 
> maximum length limit of the syn queue.
>
> So simply changing one of them and wanting to increase the syn queue is not 
> working.
>
> In our last discussion, I understood tcp_max_syn_backlog will retain the last 
> quarter to the IP that has been proven to be alive

That discussion pertains to a code path that is relevant if syncookies
are disabled, which is very uncommon (see above).

> But if tcp_max_syn_backlog is very large, the syn queue will be filled as 
> well.
>
> So I don't understand why not just use a variable to control the syn queue.
>
> For example, just use tcp_max_syn_backlog, which is the maximum length limit 
> for the syn queue, and it can also be retained to prove that the IP remains 
> the last quarter.
>
> The backlog parameter of the listen system call only controls the accpet 
> queue.
>
> I feel this is more reasonable. If I don't look at the source code, I really 
> can't guess the backlog parameter actually controls the syn queue.
>
> I always thought that it only controlled the accept queue before I looked at 
> the source code, because the man page is written like this.

Keep in mind that the semantics of the listen() argument and the
/proc/sys/net/ipv4/tcp_max_syn_backlog sysctl knob, as described in
the man page, are part of the Linux kernel's user-visible API. So, in
essence, they cannot be changed. Changing the semantics of system
calls and sysctl knobs breaks applications and system configuration
scripts. :-)

neal

Re: Why not use all the syn queues? in the function "tcp_conn_request", I have some questions.

2018-09-04 Thread Neal Cardwell

On Tue, Sep 4, 2018 at 1:48 AM Ttttabcd  wrote:
>
> Hello everyone,recently I am looking at the source code for handling TCP 
> three-way handshake(Linux Kernel version 4.18.5).
>
> I found some strange places in the source code for handling syn messages.
>
> in the function "tcp_conn_request"
>
> This code will be executed when we don't enable the syn cookies.
>
> if (!net->ipv4.sysctl_tcp_syncookies &&
> (net->ipv4.sysctl_max_syn_backlog - 
> inet_csk_reqsk_queue_len(sk) <
>  (net->ipv4.sysctl_max_syn_backlog >> 2)) &&
> !tcp_peer_is_proven(req, dst)) {
> /* Without syncookies last quarter of
>  * backlog is filled with destinations,
>  * proven to be alive.
>  * It means that we continue to communicate
>  * to destinations, already remembered
>  * to the moment of synflood.
>  */
> pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
> rsk_ops->family);
> goto drop_and_release;
> }
>
> But why don't we use all the syn queues?

If tcp_peer_is_proven() returns true then we do allow ourselves to use
the whole queue.

> Why do we need to leave the size of (net->ipv4.sysctl_max_syn_backlog >> 2) 
> in the queue?
>
> Even if the system is attacked by a syn flood, there is no need to leave a 
> part. Why do we need to leave a part?

The comment describes the rationale. If syncookies are disabled, then
the last quarter of the backlog is reserved for filling with
destinations that were proven to be alive, according to
tcp_peer_is_proven() (which uses RTTs measured in previous
connections). The idea is that if there is a SYN flood, we do not want
to use all of our queue budget on attack traffic but instead want to
reserve some queue space for SYNs from real remote machines that we
have actually contacted in the past.

> The value of sysctl_max_syn_backlog is the maximum length of the queue only 
> if syn cookies are enabled.

Even if syncookies are disabled, sysctl_max_syn_backlog is the maximum
length of the queue.

> This is the first strange place, here is another strange place
>
> __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
>
> if ((net->ipv4.sysctl_tcp_syncookies == 2 ||
>  inet_csk_reqsk_queue_is_full(sk)) && !isn) {
>
> if (!want_cookie && !isn) {
>
> The value of "isn" comes from TCP_SKB_CB(skb)->tcp_tw_isn, then it is judged 
> twice whether its value is indeed 0.
>
> But "tcp_tw_isn" is initialized in the function "tcp_v4_fill_cb"
>
> TCP_SKB_CB(skb)->tcp_tw_isn = 0;
>
> So it has always been 0, I used printk to test, and the result is always 0.

That field is also set in tcp_timewait_state_process():

  TCP_SKB_CB(skb)->tcp_tw_isn = isn;

So there can be cases where it is not 0.

Hope that helps,
neal

[PATCH net] tcp_bbr: fix bw probing to raise in-flight data for very small BDPs

2018-07-27 Thread Neal Cardwell

For some very small BDPs (with just a few packets) there was a
quantization effect where the target number of packets in flight
during the super-unity-gain (1.25x) phase of gain cycling was
implicitly truncated to a number of packets no larger than the normal
unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow
scenarios some flows could get stuck with a lower bandwidth, because
they did not push enough packets inflight to discover that there was
more bandwidth available. This was really only an issue in multi-flow
LAN scenarios, where RTTs and BDPs are low enough for this to be an
issue.

This fix ensures that gain cycling can raise inflight for small BDPs
by ensuring that in PROBE_BW mode target inflight values with a
super-unity gain are always greater than inflight values with a gain
<= 1. Importantly, this applies whether the inflight value is
calculated for use as a cwnd value, or as a target inflight value for
the end of the super-unity phase in bbr_is_next_cycle_phase() (both
need to be bigger to ensure we can probe with more packets in flight
reliably).

This is a candidate fix for stable releases.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Priyaranjan Jha 
Reviewed-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 3b5f45b9e81eb..13d34427ca3dd 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -358,6 +358,10 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int 
gain)
/* Reduce delayed ACKs by rounding up cwnd to the next even number. */
cwnd = (cwnd + 1) & ~1U;
 
+   /* Ensure gain cycling gets inflight above BDP even for small BDPs. */
+   if (bbr->mode == BBR_PROBE_BW && gain > BBR_UNIT)
+   cwnd += 2;
+
return cwnd;
 }
 
-- 
2.18.0.345.g5c9ce644c3-goog

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-24 Thread Neal Cardwell

On Tue, Jul 24, 2018 at 1:42 PM Lawrence Brakmo  wrote:
>
> Note that without this fix the 99% latencies when doing 10KB RPCs
> in a congested network using DCTCP are 40ms vs. 190us with the patch.
> Also note that these 40ms high tail latencies started after commit
> 3759824da87b30ce7a35b4873b62b0ba38905ef5 in Jul 2015,
> which triggered the bugs/features we are fixing/adding. I agree it is a
> debatable whether it is a bug fix or a feature improvement and I am
> fine either way.

Good point. The fact that this greatly mitigates a regression in DCTCP
performance resulting from 3759824da87b30ce7a35b4873b62b0ba38905ef5
("tcp: PRR uses CRB mode by default and SS mode conditionally") IMHO
seems to be a good argument for putting this patch ("tcp: ack
immediately when a cwr packet arrives") in the "net" branch and stable
releases.

thanks,
neal

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-24 Thread Neal Cardwell

On Tue, Jul 24, 2018 at 1:07 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 23, 2018 at 7:23 PM, Daniel Borkmann  wrote:
> > Should this go to net tree instead where all the other fixes went?
> I am neutral but this feels more like a feature improvement

I agree this feels like a feature improvement rather than a bug fix.

neal

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-23 Thread Neal Cardwell

On Mon, Jul 23, 2018 at 8:49 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives, the ACK was sometimes delayed even though it
> is CWR marked, adding up to 40ms to the RPC latency.
>
> This patch insures that CWR marked data packets arriving will be acked
> immediately.
...
> Modified based on comments by Neal Cardwell 
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_input.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)

Seems like a nice mechanism to have, IMHO.

Acked-by: Neal Cardwell 

Thanks!
neal

Re: [PATCH net-next] tcp: expose both send and receive intervals for rate sample

2018-07-09 Thread Neal Cardwell

On Mon, Jul 9, 2018 at 1:58 PM Deepti Raghavan  wrote:
>
> Congestion control algorithms, which access the rate sample
> through the tcp_cong_control function, only have access to the maximum
> of the send and receive interval, for cases where the acknowledgment
> rate may be inaccurate due to ACK compression or decimation. Algorithms
> may want to use send rates and receive rates as separate signals.
>
> Signed-off-by: Deepti Raghavan 
> ---
>  include/net/tcp.h   | 2 ++
>  net/ipv4/tcp_rate.c | 4 
>  2 files changed, 6 insertions(+)

Thanks for re-sending. It does seem to be showing up in patchwork now:
  https://patchwork.ozlabs.org/patch/941532/
And I can confirm I'm able to apply it to net-next.

Acked-by: Neal Cardwell 

thanks,
neal

Re: [PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP

2018-07-07 Thread Neal Cardwell

On Sat, Jul 7, 2018 at 7:15 AM David Miller  wrote:
>
> From: Lawrence Brakmo 
> Date: Tue, 3 Jul 2018 09:26:13 -0700
>
> > When have observed high tail latencies when using DCTCP for RPCs as
> > compared to using Cubic. For example, in one setup there are 2 hosts
> > sending to a 3rd one, with each sender having 3 flows (1 stream,
> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
> >
> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
> > 1MB RPCs2.6ms   5.5ms 43ms  208ms
> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
>  ...
> > v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in
> > tcp_event_ack_sent. Based on Neal Cardwell 
> > feedback.
> > Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead of 
> > modifying
> > tcp_ack_send_check to insure an ACK when cwr is received.
> > v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check.
> >
> > [PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent
> > [PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet
>
> Neal and co., what are your thoughts right now about this patch series?
>
> Thank you.

IMHO these patches are a definite improvement over what we have now.

That said, in chatting with Yuchung before the July 4th break, I think
Yuchung and I agreed that we would ideally like to see something like
the following:

(1) refactor the DCTCP code to check for pending delayed ACKs directly
using existing state (inet_csk(sk)->icsk_ack.pending &
ICSK_ACK_TIMER), and remove the ca->delayed_ack_reserved DCTCP field
and the CA_EVENT_DELAYED_ACK and CA_EVENT_NON_DELAYED_ACK callbacks
added for DCTCP (which Larry determined had at least one bug).

(2) fix the bug with the DCTCP call to tcp_send_ack(sk) causing
delayed ACKs to be incorrectly dropped/forgotten (not yet addressed by
this patch series)

(3) then with fixes (1) and (2) in place, re-run tests and see if we
still need Larry's heuristic (in patch 2) to fire an ACK immediately
if a receiver receives a CWR packet (I suspect this is still very
useful, but I think Yuchung is reluctant to add this complexity unless
we have verified it's still needed after (1) and (2))

Our team may be able to help out with some proposed patches for (1) and (2).

In any case, I would love to have Yuchung and Eric weigh in (perhaps
Monday) before we merge this patch series.

Thanks,
neal

Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Neal Cardwell

On Tue, Jul 3, 2018 at 11:10 AM Lawrence Brakmo  wrote:
>
> On 7/2/18, 5:52 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
>  wrote:
>
> On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
> >
> > When have observed high tail latencies when using DCTCP for RPCs as
> > compared to using Cubic. For example, in one setup there are 2 hosts
> > sending to a 3rd one, with each sender having 3 flows (1 stream,
> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
> >
> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
> >  1MB RPCs2.6ms   5.5ms 43ms  208ms
> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
> >
> > Looking at tcpdump traces showed that there are two causes for the
> > latency.
> >
> >   1) RTOs caused by the receiver sending a dup ACK and not ACKing
> >  the last (and only) packet sent.
> >   2) Delaying ACKs when the sender has a cwnd of 1, so everything
> >  pauses for the duration of the delayed ACK.
> >
> > The first patch fixes the cause of the dup ACKs, not updating DCTCP
> > state when an ACK that was initially delayed has been sent with a
> > data packet.
> >
> > The second patch insures that an ACK is sent immediately when a
> > CWR marked packet arrives.
> >
> > With the patches the latencies for DCTCP now look like:
> >
> >dctcp 99%  dctcp 99.9%
> >  1MB RPCs5.8ms   6.9ms
> > 10KB RPCs146us   203us
> >
> > Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> > the 10KB latencies are much smaller than Cubic's. These patches fix
> > issues on the receiver, but tcpdump traces indicate there is an
> > opportunity to also fix an issue at the sender that adds about 3ms
> > to the tail latencies.
> >
> > The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
> >
> >Host A sends the last packets of the request
> >Host B receives them, and the last packet is marked with congestion 
> (CE)
> >Host B sends ACKs for packets not marked with congestion
> >Host B sends data packet with reply and ACK for packet marked with
> >   congestion (TCP flag ECE)
> >Host A receives ACKs with no ECE flag
> >Host A receives data packet with ACK for the last packet of request
> >   and which has TCP ECE bit set
> >Host A sends 1st data packet of the next request with TCP flag CWR
> >Host B receives the packet (as seen in tcpdump at B), no CE flag
> >Host B sends a dup ACK that also has the TCP ECE flag
> >Host A RTO timer fires!
> >Host A to send the next packet
> >Host A receives an ACK for everything it has sent (i.e. Host B
> >   did receive 1st packet of request)
> >Host A send more packets…
> >
> > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
> >
> >  net/ipv4/tcp_input.c  | 16 +++-
> >  net/ipv4/tcp_output.c |  4 ++--
> >  2 files changed, 13 insertions(+), 7 deletions(-)
>
> Thanks, Larry. Just for context, can you please let us know whether
> your tests included zero, one, or both of Eric's recent commits
> (listed below) that tuned the number of ACKs after ECN events? (Or
> maybe the tests were literally using a net-next kernel?) Just wanted
> to get a better handle on any possible interactions there.
>
> Thanks!
>
> neal
>
> Yes, my test kernel includes both patches listed below.

OK, great.

> BTW, I will send a new
> patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to 
> tcp_ecn_accept_cwr.

OK, sounds good to me.

thanks,
neal

Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Neal Cardwell

On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
> >
> > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> > notifications to keep track if it needs to send an ACK for packets that
> > were received with a particular ECN state but whose ACK was delayed.
> >
> > Under some circumstances, for example when a delayed ACK is sent with a
> > data packet, DCTCP state was not being updated due to a lack of
> > notification that the previously delayed ACK was sent. As a result, it
> > would sometimes send a duplicate ACK when a new data packet arrived.
> >
> > This patch insures that DCTCP's state is correctly updated so it will
> > not send the duplicate ACK.
> Sorry to chime-in late here (lame excuse: IETF deadline)
>
> IIRC this issue would exist prior to 4.11 kernel. While it'd be good
> to fix that, it's not clear which patch introduced the regression
> between 4.11 and 4.16? I assume you tested Eric's most recent quickack
> fix.

I don't think Larry is saying there's a regression between 4.11 and
4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
patch here:

  https://patchwork.ozlabs.org/patch/935233/

said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
was still bad but with different netstat counters (no RTOs but still
high tail latency):

"""
On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there.
"""

I suspect the RTOs disappeared but latencies remained too high because
between 4.11 and 4.16 we introduced:
  tcp: allow TLP in ECN CWR

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd

So the RTOs probably disappeared because this commit turned them into
TLPs. But the latencies remained high because the fundamental bug
remained throughout 4.11 and 4.16 and today: the DCTCP use of
tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
when they should not have been.

> In terms of the fix itself, it seems odd the tcp_send_ack() call in
> DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
> delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
> "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
> cancelled, because that prior-ACK really is a supplementary old ACK.

This patch is fixing an issue that's orthogonal to the one you are
talking about. Using the taxonomy from our team's internal discussion
yesterday, the issue you mention where the DCTCP "prior" ACK is
cancelling delayed ACKs is issue (4); the issue that this particular
"tcp: notify when a delayed ack is sent" patch from Larry fixes is
issue (3). It's a bit tricky because both issues appeared in Larry's
trace summary and packetdrill script to reproduce the issue.

neal

Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-02 Thread Neal Cardwell

On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>  1MB RPCs2.6ms   5.5ms 43ms  208ms
> 10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>  1MB RPCs5.8ms   6.9ms
> 10KB RPCs146us   203us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion (CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
>
>  net/ipv4/tcp_input.c  | 16 +++-
>  net/ipv4/tcp_output.c |  4 ++--
>  2 files changed, 13 insertions(+), 7 deletions(-)

Thanks, Larry. Just for context, can you please let us know whether
your tests included zero, one, or both of Eric's recent commits
(listed below) that tuned the number of ACKs after ECN events? (Or
maybe the tests were literally using a net-next kernel?) Just wanted
to get a better handle on any possible interactions there.

Thanks!

neal

---
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
Author: Eric Dumazet 
Date:   Mon May 21 15:08:57 2018 -0700

tcp: do not aggressively quick ack after ECN events

ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.

This should reduce the extra load noticed in DCTCP environments,
after congestion events.

This is part 2 of our effort to reduce pure ACK packets.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=15ecbe94a45ef88491ca459b26efdd02f91edb6d
commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d
Author: Eric Dumazet 
Date:   Wed Jun 27 08:47:21 2018 -0700

tcp: add one more quick ack after after ECN events

Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
about our recent patch removing ~16 quick acks after ECN events.

tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
but in the case the sender cwnd was lowered to 1, we do not want
    to have a delayed ack for the next packet we will receive.

Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
Signed-off-by: Eric Dumazet 
Reported-by: Neal Cardwell 
Cc: Lawrence Brakmo 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller

Re: [PATCH net-next 1/2] tcp: notify when a delayed ack is sent

2018-07-02 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_output.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..41f6ad7a21e4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
> unsigned int pkts)
> __sock_put(sk);
> }
> tcp_dec_quickack_mode(sk, pkts);
> +   if (inet_csk_ack_scheduled(sk))
> +   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }

Thanks for this fix! Seems like this would work, but if I am reading
this correctly then it seems like this would cause a duplicate call to
tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) when we are sending a pure
ACK (delayed or non-delayed):

(1) once from tcp_send_ack() before we send the ACK:

tcp_send_ack(struct sock *sk)
 ->tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

(2) then again from tcp_event_ack_sent() after we have sent the ACK:

tcp_event_ack_sent()
->   if (inet_csk_ack_scheduled(sk))
 tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

What if we remove the original CA_EVENT_NON_DELAYED_ACK call and just
replace it with your new one? (not compiled, not tested):

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3889dcd4868d4..bddb49617d9be 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -184,6 +184,8 @@ static inline void tcp_event_ack_sent(struct sock
*sk, unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }

@@ -3836,8 +3838,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;

-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.

Aside from lower CPU overhead, one nice benefit of that is that we
then only call tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) in one
place, which might be a little easier to reason about.

Does that work?

neal

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-07-02 Thread Neal Cardwell

On Sat, Jun 30, 2018 at 9:47 PM Lawrence Brakmo  wrote:
> I see two issues, one is that entering quickack mode as you
> mentioned does not insure that it will still be on when the CWR
> arrives. The second issue is that the problem occurs right after the
> receiver sends a small reply which results in entering pingpong mode
> right before the sender starts the new request by sending just one
> packet (i.e. forces delayed ack).
>
> I compiled and tested your patch. Both 99 and 99.9 percentile
> latencies are around 40ms. Looking at the packet traces shows that
> some CWR marked packets are not being ack immediately (delayed by
> 40ms).

Thanks, Larry! So your tests provide nice, specific evidence that it
is good to force an immediate ACK when a receiver receives a packet
with CWR marked. Given that, I am wondering what the simplest way is
to achieve that goal.

What if, rather than plumbing a new specific signal into
__tcp_ack_snd_check(), we use the existing general quick-ack
mechanism, where various parts of the TCP stack (like
__tcp_ecn_check_ce())  are already using the quick-ack mechanism to
"remotely" signal to __tcp_ack_snd_check() that they want an immediate
ACK.

For example, would it work to do something like:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c53ae5fc834a5..8168d1938b376 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -262,6 +262,12 @@ static void __tcp_ecn_check_ce(struct sock *sk,
const struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);

+   /* If the sender is telling us it has entered CWR, then its cwnd may be
+* very low (even just 1 packet), so we should ACK immediately.
+*/
+   if (tcp_hdr(skb)->cwr)
+   tcp_enter_quickack_mode(sk, 2);
+
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
case INET_ECN_NOT_ECT:
/* Insure that GCN will not continue to mark packets. */

And then since that broadens the mission of this function beyond
checking just the ECT/CE bits, I supposed we could rename the
__tcp_ecn_check_ce() and tcp_ecn_check_ce() functions to
__tcp_ecn_check() and tcp_ecn_check(), or something like that.

Would that work for this particular issue?

neal

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-30 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 3:52 PM Ilpo Järvinen  wrote:
> > +.005 < . 1:1(0) ack 2001 win 257
>
> Why did the receiver send a cumulative ACK only for 2001?

Sorry, you are right Ilpo. Upon further reflection, the packetdrill
scenario I posted is not a realistic one, and I agree we should not
worry about it.

As I mentioned, I ran your patch through all our team's TCP
packetdrill tests, and it passes all of the tests. One of our tests
needed updating, because if there is a non-SACK connection with a
spurious RTO due to a delayed flight of ACKs then the FRTO undo now
happens one ACK later (when we get an ACK that doesn't cover a
retransmit). But that seems fine to me.

I also cooked the new packetdrill test below to explicitly cover this
case you are addressing (please let me know if you have an alternate
suggestion).

Tested-by: Neal Cardwell 
Acked-by: Neal Cardwell 

Thanks!
neal

---

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 3 packets. First is really lost. And the dupacks
// for the data packets that arrived at the reciver are slow in arriving.
   +0 write(4, ..., 3000) = 3000
   +0 > P. 1:3001(3000) ack 1

// RTO and retransmit head. This fills a real loss.
 +.22 > . 1:1001(1000) ack 1

// Dupacks for packets 2 and 3 arrive.
+.02  < . 1:1(0) ack 1 win 257
   +0 < . 1:1(0) ack 1 win 257

// The cumulative ACK for all the data arrives. We do not undo, because
// this is a non-SACK connection, and retransmitted data was ACKed.
// It's good that there's no FRTO undo, since a packet was really lost.
// Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss
// until something beyond high_seq is ACKed.
+.005 < . 1:1(0) ack 3001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion (CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures
that at least for CWR packets the forgotten delayed ACK does not
matter. But for packets without CWR I believe the buggy "forgotten
delayed ACK" sequence above is still possible, and could still plague
DCTCP with high tail latencies in some cases. I suspect that after
your CWR-forces-ACK patch it may not be jumping out in performance
test results because (a) if there is no CWR involved then usually the
cwnd is high enough that missing delayed ACK does not cause a stall,
and (b) if there is a CWR involved then your patch forces an immediate
ACK. But AFAICT that forgotten delayed ACK bug is still there.

What do folks think?

neal

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-06-30 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives it was sometimes delayed adding up to 40ms to
> the latency.
>
> This patch insures that CWR makred data packets arriving will be acked
> immediately.
>
> Signed-off-by: Lawrence Brakmo 
> ---

Thanks, Larry. Ensuring that CWR-marked data packets arriving will be
acked immediately sounds like a good goal to me.

I am wondering if we can have a simpler fix.

The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR
bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that
would have otherwise used the tcp_enter_quickack_mode() mechanism to
force an ACK:

__tcp_ecn_check_ce():
...
case INET_ECN_CE:
  if (tcp_ca_needs_ecn(sk))
tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
   // -> dctcp_ce_state_0_to_1()
   // ->  tp->ecn_flags |= TCP_ECN_DEMAND_CWR;

  if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
tcp_enter_quickack_mode(sk, 1);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
  }
  tp->ecn_flags |= TCP_ECN_SEEN;
  break;

So it seems like the bug here may be that dctcp_ce_state_0_to_1()  is
setting the TCP_ECN_DEMAND_CWR  bit in ecn_flags, when really it
should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in
which case the code would already properly force an immediate ACK.

So, what if we use this fix instead (not compiled, not tested):

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..4fecd2824edb 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)

ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 }

 static void dctcp_ce_state_1_to_0(struct sock *sk)

What do you think?

neal

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-29 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen  wrote:
>
> If SACK is not enabled and the first cumulative ACK after the RTO
> retransmission covers more than the retransmitted skb, a spurious
> FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> The reason is that any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case so the check for that bit won't help like it does with SACK).
> Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> in tcp_process_loss.
>
> We need to use more strict condition for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted
> to prove that progress is due to original transmissions. Only then
> keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> non-SACK case.
>
> (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> to better indicate its purpose but to keep this change minimal, it
> will be done in another patch).
>
> Besides burstiness and congestion control violations, this problem
> can result in RTO loop: When the loss recovery is prematurely
> undoed, only new data will be transmitted (if available) and
> the next retransmission can occur only after a new RTO which in case
> of multiple losses (that are not for consecutive packets) requires
> one RTO per loss to recover.
>
> Signed-off-by: Ilpo Järvinen 
> ---
>  net/ipv4/tcp_input.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 045d930..8e5522c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
>
> if (tcp_is_reno(tp)) {
> tcp_remove_reno_sacks(sk, pkts_acked);
> +
> +   /* If any of the cumulatively ACKed segments was
> +* retransmitted, non-SACK case cannot confirm that
> +* progress was due to original transmission due to
> +* lack of TCPCB_SACKED_ACKED bits even if some of
> +* the packets may have been never retransmitted.
> +*/
> +   if (flag & FLAG_RETRANS_DATA_ACKED)
> +   flag &= ~FLAG_ORIG_SACK_ACKED;
> } else {
> int delta;

Thanks, Ilpo. I ran this through our TCP packetdrill tests and only
got one failure, which detected the change you made, so that on the
first ACK in a non-SACK F-RTO case we stay in CA_Loss.

However, depending on the exact scenario we are concerned about here,
this may not be a complete solution. Consider the packetdrill test I
cooked below, which is for a scenario where we imagine a non-SACK
connection, with data packet #1 and all dupacks lost. With your patch
we don't have an F-RTO undo on the first cumulative ACK, which is a
definite improvement. However, we end up with an F-RTO undo on the
*second* cumulative ACK, because the second ACK didn't cover any
retransmitted data. That means that we end up in the second round trip
back in the initial slow-start mode with a very high cwnd and infinite
ssthresh, even though data was actually lost in the first round trip.

I'm not sure how much we care about all these cases. Perhaps we should
just disable F-RTO if the connection has no SACK enabled? These
non-SACK connections are just such a rare case at this point that I
would propose it's not worth spending too much time on this.

The following packetdrill script passes when Ilpo's patch is applied.
This documents the behavior, which I think is questionable:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4
   +0 write(4, ..., 27000) = 27000
   +0 > . 1:10001(1) ack 1

// Suppose 1:1001 is lost and all dupacks are lost.

// RTO and retransmit head. This fills a real hole.
 +.22 > . 1:1001(1000) ack 1

+.005 < . 1:1(0) ack 2001 win 257
   +0 > . 10001:13001(3000) ack 1
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 3, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

+.002 < . 1:1(0) ack 10001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 18, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 2147483647, tcpi_snd_ssthresh }%
   +0 > . 13001:15001(2000) ack 1

---

neal

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-28 Thread Neal Cardwell

On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo  wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>Host A sends the last packets of the request
>
>Host B receives them, and the last packet is marked with congestion (CE)
>
>Host B sends ACKs for packets not marked with congestion
>
>Host B sends data packet with reply and ACK for packet marked with 
> congestion (TCP flag ECE)
>
>Host A receives ACKs with no ECE flag
>
>Host A receives data packet with ACK for the last packet of request and 
> has TCP ECE bit set
>
>Host A sends 1st data packet of the next request with TCP flag CWR
>
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>Host B sends a dup ACK that also has the TCP ECE flag
>
>Host A RTO timer fires!
>
>Host A to send the next packet
>
>Host A receives an ACK for everything it has sent (i.e. Host B did receive 
> 1st packet of request)
>
>Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal

Re: [PATCH net] tcp: add one more quick ack after after ECN events

2018-06-27 Thread Neal Cardwell

On Wed, Jun 27, 2018 at 11:47 AM Eric Dumazet  wrote:
>
> Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
> tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
> about our recent patch removing ~16 quick acks after ECN events.
>
> tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
> but in the case the sender cwnd was lowered to 1, we do not want
> to have a delayed ack for the next packet we will receive.
>
> Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
> Signed-off-by: Eric Dumazet 
> Reported-by: Neal Cardwell 
> Cc: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-27 Thread Neal Cardwell

On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo  wrote:
> The only issue is if it is safe to always use 2 or if it is better to
> use min(2, snd_ssthresh) (which could still trigger the problem).

Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
that should be the same as just 2, since:

(a) RFCs mandate ssthresh should not be below 2, e.g.
https://tools.ietf.org/html/rfc5681 page 7:

 ssthresh = max (FlightSize / 2, 2*SMSS)(4)

(b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
that constraint, and always have an ssthresh of at least 2.

And if some CC misbehaves and uses a lower ssthresh, then taking
min(2, snd_ssthresh) will trigger problems, as you note.

> +   tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);

AFAICT this does seem like it will make the sender behavior more
aggressive in cases with high loss and/or a very low per-flow
fair-share.

Old:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packet 1
o using ACKs, slow-start upward

New:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packets 1 and 2
o using ACKs, slow-start upward

In the extreme case, if the available fair share is less than 2
packets, whereas inflight would have oscillated between 1 packet and 2
packets with the existing code, it now seems like with this commit the
inflight will now hover at 2. It seems like this would have
significantly higher losses than we had with the existing code.

This may or may not be OK in practice, but IMHO it is worth mentioning
and discussing.

neal

Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child

2018-06-26 Thread Neal Cardwell

On Tue, Jun 26, 2018 at 11:46 AM Eric Dumazet  wrote:
>
> Signed-off-by: Eric Dumazet 
> ---
>  net/ipv4/tcp_minisocks.c | 223 ---
>  1 file changed, 113 insertions(+), 110 deletions(-)

Yes, very nice clean-up! Thanks for doing this.

Acked-by: Neal Cardwell 

neal

Re: [PATCH net] sctp: not allow to set rto_min with a value below 200 msecs

2018-05-29 Thread Neal Cardwell

On Tue, May 29, 2018 at 11:45 AM Marcelo Ricardo Leitner <
marcelo.leit...@gmail.com> wrote:
> - patch2 - fix rtx attack vector
>- Add the floor value to rto_min to HZ/20 (which fits the values
>  that Michael shared on the other email)

I would encourage allowing minimum RTO values down to 5ms, if the ACK
policy in the receiver makes this feasible. Our experience is that in
datacenter environments it can be advantageous to allow timer-based loss
recoveries using timeout values as low as 5ms, e.g.:

https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
   https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

cheers,
neal

Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode

2018-05-22 Thread Neal Cardwell

On Tue, May 22, 2018 at 8:31 PM kbuild test robot  wrote:

> Hi Eric,

> Thank you for the patch! Yet something to improve:

> [auto build test ERROR on net/master]
> [also build test ERROR on v4.17-rc6 next-20180517]
> [cannot apply to net-next/master]
> [if your patch is applied to the wrong git tree, please drop us a note to
help improve the system]

> url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-max_quickacks-param-to-tcp_incr_quickack-and-tcp_enter_quickack_mode/20180523-075103
> config: i386-randconfig-x012-201820 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>  # save the attached .config to linux build tree
>  make ARCH=i386

> All errors (new ones prefixed by >>):

> net//ipv4/tcp_input.c: In function 'tcp_data_queue':
> >> net//ipv4/tcp_input.c:4656:2: error: too few arguments to function
'tcp_enter_quickack_mode'
>   tcp_enter_quickack_mode(sk);
>   ^~~
> net//ipv4/tcp_input.c:199:13: note: declared here
>  static void tcp_enter_quickack_mode(struct sock *sk, unsigned int
max_quickacks)
>  ^~~
...

For the record, this is an error in the tool, rather than the patch. The
tool seems to be using a stale net-next tree for building this patch.

The compile error is here in line 4656:

> ^1da177e4 Linus Torvalds   2005-04-16  4652 /* Out of window.
F.e. zero window probe. */
> ^1da177e4 Linus Torvalds   2005-04-16  4653 if
(!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
> ^1da177e4 Linus Torvalds   2005-04-16  4654 goto
out_of_window;
> ^1da177e4 Linus Torvalds   2005-04-16  4655
> 463c84b97 Arnaldo Carvalho de Melo 2005-08-09 @4656
tcp_enter_quickack_mode(sk);
> ^1da177e4 Linus Torvalds   2005-04-16  4657
> ^1da177e4 Linus Torvalds   2005-04-16  4658 if
(before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
> ^1da177e4 Linus Torvalds   2005-04-16  4659 /*
Partial packet, seq < rcv_next < end_seq */
...

But that line is not in net-next any more, after Eric's recent net-next
commit:

a3893637e1eb0e ("tcp: do not force quickack when receiving out-of-order
packets")

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/net/ipv4/tcp_input.c?id=a3893637e1eb0ef5eb1bbc52b3a8d2dfa317a35d

That commit removed that line:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0bf032839548f..f5622b2506651 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4715,8 +4715,6 @@ drop:
 if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt +
tcp_receive_window(tp)))
 goto out_of_window;

-   tcp_enter_quickack_mode(sk);
-
 if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
 /* Partial packet, seq < rcv_next < end_seq */
 SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X\n",

cheers,
neal

Re: [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events

2018-05-22 Thread Neal Cardwell

On Mon, May 21, 2018 at 6:09 PM Eric Dumazet <eduma...@google.com> wrote:

> ECN signals currently forces TCP to enter quickack mode for
> up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

> We believe this is not needed, and only sending one immediate ack
> for the current packet should be enough.

> This should reduce the extra load noticed in DCTCP environments,
> after congestion events.

> This is part 2 of our effort to reduce pure ACK packets.

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode

2018-05-22 Thread Neal Cardwell

On Mon, May 21, 2018 at 6:09 PM Eric Dumazet <eduma...@google.com> wrote:

> We want to add finer control of the number of ACK packets sent after
> ECN events.

> This patch is not changing current behavior, it only enables following
> change.

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 5:47 PM Eric Dumazet <eduma...@google.com> wrote:

> This per netns sysctl allows for TCP SACK compression fine-tuning.

> This limits number of SACK that can be compressed.
> Using 0 disables SACK compression.

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 5:47 PM Eric Dumazet <eduma...@google.com> wrote:

> This per netns sysctl allows for TCP SACK compression fine-tuning.

> Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 5:47 PM Eric Dumazet <eduma...@google.com> wrote:

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.

> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.

> This patch adds a high resolution timer and tp->compressed_ack counter.

> Instead of sending a SACK, we program this timer with a small delay,
> based on RTT and capped to 1 ms :

>  delay = min ( 5 % of RTT, 1 ms)

> If subsequent SACKs need to be sent while the timer has not yet
> expired, we simply increment tp->compressed_ack.

> When timer expires, a SACK is sent with the latest information.
> Whenever an ACK is sent (if data is sent, or if in-order
> data is received) timer is canceled.

> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
> if the sack blocks need to be shuffled, even if the timer has not
> expired.

> A new SNMP counter is added in the following patch.

> Two other patches add sysctls to allow changing the 1,000,000 and 44
> values that this commit hard-coded.

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Very nice. I like the constants and the min(rcv_rtt, srtt).

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH net-next 3/4] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 11:40 AM Eric Dumazet <eric.duma...@gmail.com>
wrote:
> On 05/17/2018 08:14 AM, Neal Cardwell wrote:
> > Is there a particular motivation for the cap of 127? IMHO 127 ACKs is
quite
> > a few to compress. Experience seems to show that it works well to have
one
> > GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It
might
> > be nice to try to match those dynamics in this SACK compression case,
so it
> > might be nice to cap the number of compressed ACKs at something like 44?
> > (0x / 1448 - 1).  That way for high-speed paths we could try to keep
> > the ACK clock going with ACKs for ~64KBytes that trigger a single TSO
skb
> > of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.

> 127 was chosen because the field is u8, and since skb allocation for the
ACK
> can fail, we could have cases were the field goes above 127.

> Ultimately, I believe a followup patch would add a sysctl, so that we can
fine-tune
> this, and eventually disable ACK compression if this sysctl is set to 0

OK, a sysctl sounds good. I would still vote for a default of 44.  :-)


> >> +   if (hrtimer_is_queued(>compressed_ack_timer))
> >> +   return;
> >> +
> >> +   /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
> >> +
> >> +   delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
> >> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >>
3)/20);
> >
> > Any particular motivation for the 2.5ms here? It might be nice to match
the
> > existing TSO autosizing dynamics and use 1ms here instead of having a
> > separate new constant of 2.5ms. Smaller time scales here should lead to
> > less burstiness and queue pressure from data packets in the network,
and we
> > know from experience that the CPU overhead of 1ms chunks is acceptable.

> This came from my tests on wifi really :)

> I also had the idea to make this threshold adjustable for wifi, like we
did for sk_pacing_shift.

> (On wifi, we might want to increase the max delay between ACK)

> So maybe use 1ms delay, when sk_pacing_shift == 10, but increase it if
sk_pacing_shift has been lowered.

Sounds good to me.

Thanks for implementing this! Overall this patch seems nice to me.

Acked-by: Neal Cardwell <ncardw...@google.com>

BTW, I guess we should spread the word to maintainers of other major TCP
stacks that they need to be prepared for what may be a much higher degree
of compression/aggregation in the SACK stream. Linux stacks going back many
years should be fine with this, but I'm not sure about the other major OSes
(they may only allow sending one MSS per ACK-with-SACKs received).

neal

Re: [PATCH net-next 3/4] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet  wrote:

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.

> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.

> This patch adds a high resolution timer and tp->compressed_ack counter.

> Instead of sending a SACK, we program this timer with a small delay,
> based on SRTT and capped to 2.5 ms : delay = min ( 5 % of SRTT, 2.5 ms)
...

Very nice. Thanks for implementing this, Eric! I was wondering if the
constants here might be worth some discussion/elaboration.

> @@ -5074,6 +5076,7 @@ static inline void tcp_data_snd_check(struct sock
*sk)
>   static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
>   {
>  struct tcp_sock *tp = tcp_sk(sk);
> +   unsigned long delay;

>  /* More than one full frame received... */
>  if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss
&&
> @@ -5085,15 +5088,31 @@ static void __tcp_ack_snd_check(struct sock *sk,
int ofo_possible)
>  (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
>   __tcp_select_window(sk) >= tp->rcv_wnd)) ||
>  /* We ACK each frame or... */
> -   tcp_in_quickack_mode(sk) ||
> -   /* We have out of order data. */
> -   (ofo_possible && !RB_EMPTY_ROOT(>out_of_order_queue))) {
> -   /* Then ack it now */
> +   tcp_in_quickack_mode(sk)) {
> +send_now:
>  tcp_send_ack(sk);
> -   } else {
> -   /* Else, send delayed ack. */
> +   return;
> +   }
> +
> +   if (!ofo_possible || RB_EMPTY_ROOT(>out_of_order_queue)) {
>  tcp_send_delayed_ack(sk);
> +   return;
>  }
> +
> +   if (!tcp_is_sack(tp) || tp->compressed_ack >= 127)
> +   goto send_now;
> +   tp->compressed_ack++;

Is there a particular motivation for the cap of 127? IMHO 127 ACKs is quite
a few to compress. Experience seems to show that it works well to have one
GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It might
be nice to try to match those dynamics in this SACK compression case, so it
might be nice to cap the number of compressed ACKs at something like 44?
(0x / 1448 - 1).  That way for high-speed paths we could try to keep
the ACK clock going with ACKs for ~64KBytes that trigger a single TSO skb
of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.

> +
> +   if (hrtimer_is_queued(>compressed_ack_timer))
> +   return;
> +
> +   /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
> +
> +   delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >> 3)/20);

Any particular motivation for the 2.5ms here? It might be nice to match the
existing TSO autosizing dynamics and use 1ms here instead of having a
separate new constant of 2.5ms. Smaller time scales here should lead to
less burstiness and queue pressure from data packets in the network, and we
know from experience that the CPU overhead of 1ms chunks is acceptable.

thanks,
neal

Re: [PATCH net-next 2/4] tcp: do not force quickack when receiving out-of-order packets

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet <eduma...@google.com> wrote:

> As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
> acks in slow start"), TCP stacks have to consider how many packets
> are acknowledged in one single ACK, because of GRO, but also
> because of ACK compression or losses.

> We plan to add SACK compression in the following patch, we
> must therefore not call tcp_enter_quickack_mode()

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH net-next 4/4] tcp: add TCPAckCompressed SNMP counter

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet <eduma...@google.com> wrote:

> This counter tracks number of ACK packets that the host has not sent,
> thanks to ACK compression.

> Sample output :

> $ nstat -n;sleep 1;nstat|egrep
"IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
> IpInReceives123250 0.0
> IpOutRequests   3684   0.0
> TcpInSegs   123251 0.0
> TcpOutSegs  3684   0.0
> TcpExtTCPAckCompressed  119252 0.0

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH net-next 1/4] tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet <eduma...@google.com> wrote:

> Socket can not disappear under us.

> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH net] tcp: purge write queue in tcp_connect_init()

2018-05-15 Thread Neal Cardwell

On Tue, May 15, 2018 at 12:14 AM Eric Dumazet <eduma...@google.com> wrote:

> syzkaller found a reliable way to crash the host, hitting a BUG()
> in __tcp_retransmit_skb()

> Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
> in tcp_connect_init() at the point we init snd_una/write_seq.

> This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()

> kernel BUG at net/ipv4/tcp_output.c:2837!
...

> Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Yuchung Cheng <ych...@google.com>
> Cc: Neal Cardwell <ncardw...@google.com>
> Reported-by: syzbot <syzkal...@googlegroups.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net] tcp: restore autocorking

2018-05-03 Thread Neal Cardwell

On Wed, May 2, 2018 at 11:25 PM Eric Dumazet <eduma...@google.com> wrote:

> When adding rb-tree for TCP retransmit queue, we inadvertently broke
> TCP autocorking.

> tcp_should_autocork() should really check if the rtx queue is not empty.

...

> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: Michael Wenig <mwe...@vmware.com>
> Tested-by: Michael Wenig <mwe...@vmware.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Nice. Thanks, Eric!

neal

[PATCH net] tcp_bbr: fix to zero idle_restart only upon S/ACKed data

2018-05-01 Thread Neal Cardwell

Previously the bbr->idle_restart tracking was zeroing out the
bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
e.g. receiving incoming data or receiver window updates. In such
situations BBR would forget that this was a restart-from-idle
situation, and if the min_rtt had expired it would unnecessarily enter
PROBE_RTT (even though we were actually restarting from idle but had
merely forgotten that fact).

The fix is simple: we need to remember we are restarting from idle
until we receive a S/ACK for some data (a S/ACK for the first flight
of data we send as we are restarting).

This commit is a stable candidate for kernels back as far as 4.9.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Priyaranjan Jha <priyar...@google.com>
Signed-off-by: Yousuk Seung <ysse...@google.com>
---
 net/ipv4/tcp_bbr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 158d105e76da1..58e2f479ffb4d 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -806,7 +806,9 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
}
}
}
-   bbr->idle_restart = 0;
+   /* Restart after idle ends only once we process a new S/ACK for data */
+   if (rs->delivered > 0)
+   bbr->idle_restart = 0;
 }
 
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH net-next] tcp: remove mss check in tcp_select_initial_window()

2018-04-26 Thread Neal Cardwell

On Thu, Apr 26, 2018 at 12:58 PM Wei Wang <wei...@google.com> wrote:

> From: Wei Wang <wei...@google.com>

> In tcp_select_initial_window(), we only set rcv_wnd to
> tcp_default_init_rwnd() if current mss > (1 << wscale). Otherwise,
> rcv_wnd is kept at the full receive space of the socket which is a
> value way larger than tcp_default_init_rwnd().
> With larger initial rcv_wnd value, receive buffer autotuning logic
> takes longer to kick in and increase the receive buffer.

> In a TCP throughput test where receiver has rmem[2] set to 125MB
> (wscale is 11), we see the connection gets recvbuf limited at the
> beginning of the connection and gets less throughput overall.

> Signed-off-by: Wei Wang <wei...@google.com>
> Acked-by: Eric Dumazet <eduma...@google.com>
> Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
> Acked-by: Yuchung Cheng <ych...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Wei!

neal

packetdrill 2.0 release

2018-04-24 Thread Neal Cardwell

Hi All,

We're happy to announce the 2.0 release of the Google version of the
packetdrill network testing tool.

The code may be found at the packetdrill-v2.0 tag in the Google packetdrill
github repo:
   https://github.com/google/packetdrill

The commit is here:

https://github.com/google/packetdrill/commit/9a0ade62b7c8e3a19854b5855178dc3bb9d7f453

The 2.0 commit message, summarizing features and contributors, is included
below for a quick overview.

cheers,
neal

---
net-test: packetdrill: merge Google packetdrill changes through April 2018

This commit merges into Google's public packetdrill repository the
majority of the packetdrill tool changes made at Google in the period
2013-2018 (after the initial open source release of packetdrill).

Major features added in this commit include:

+ support for testing:
   + cmsg data
   + TCP send timestamping
   + TCP timestamping opt stats (TCP_NLA_BUSY and friends)
   + TCP zero-copy (e.g. see --send_omit_free)
   + TCP MD5 options
   + TCP urgent pointer field
   + experimental and RFC-compliant TCP Fast Open options
   + ICMP sockets
   + the IPv4 or IPv6 TOS field
   + IPv6 flow labels
   + in IPv6-only environments
+ wider system call support:
   + epoll system calls (epoll_create(), epoll_ctl(), epoll_wait())
   + pipe()
   + splice()
   + cap_set()
+ optional final clean-up commands for destructor-like tear-down
   commands that are basically always executed at termination, whether
   scripts fail or succeed
+ improved Python support:
   + exporting symbolic names for tcpi_state values
   + exporting recent additions to Linux struct tcp_info
   + exporting TCP_CC_INFO for Vegas, DCTCP, BBR
   + exporting SO_MEMINFO
+ the ability to test shared libraries that support the sockets API,
   rather than just the kernel sockets API (see packetdrill.h)
+ preprocessor-style symbol definitions, e.g. -Dfoo=bar
+ support for random local IP addresses

Willem de Bruijn spearheaded this effort to upstream this batch of
changes, and put in a huge amount of work to make this happen. I would
like to thank him for all his work on this.

I would also like to thank the following Googlers for their
contributions over the years to the packetdrill code base, which are
reflected in this patch:

   Wei Wang
   Maciej Żenczykowski
   Yuchung Cheng
   Eric Dumazet
   Soheil Hassas Yeganeh
   Dimitris Michailidis
   Willem de Bruijn
   Yaogong Wang
   Eric Salo
   Chonggang Li
   Priyaranjan Jha
   Andreas Terzis
   Xiao Jia
   Mike Maloney
   Barath Raghavan
   Yousuk Seung
   Nandita Dukkipati
   Michael Davidson
   Hsiao-keng Jerry Chu
   Greg Thelen
   Chema Gonzalez
   Luigi Rizzo
   Kevin Athey
   Jeff Grafton
   Francis Y. Yan
   Fabien Duchene
   Bill Sommerfeld
   Anatol Pomazau

This commit has been verified to build cleanly with the default gcc
compiler on the following Linux distributions:

   Debian 8
   Debian 9
   Red Hat Enterprise Linux 7.4
   Ubuntu 14.04
   Ubuntu 17.10

This commit has not been tested on or ported to any BSD variants, due
to lack of time among members of our team. We are happy to accept
patches to get it to compile/run on popular BSD variants.

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-04-04 Thread Neal Cardwell

On Wed, Apr 4, 2018 at 1:41 PM Yuchung Cheng <ych...@google.com> wrote:

> On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell <ncardw...@google.com>
wrote:
> > n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ych...@google.com> wrote:
> >> Agreed. That's a good point. And I would much preferred to rename that
> >> to FLAG_ORIG_PROGRESS (w/ updated comment).
> >
> >> so I think we're in agreement to use existing patch w/ the new name
> >> FLAG_ORIG_PROGRESS
> >
> > Yes, SGTM.
> >
> > I guess this "prevent bogus FRTO undos" patch would go to "net" branch
and
> > the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
> > branch?
> huh? why not one patch ... this is getting close to patch-split paralyses.

The flag rename seemed like a cosmetic issue that was not needed for the
fix. Smelled like net-next to me. But I don't feel strongly. However you
guys want to package it is fine with me. :-)

neal

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-04-04 Thread Neal Cardwell

n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng  wrote:
> Agreed. That's a good point. And I would much preferred to rename that
> to FLAG_ORIG_PROGRESS (w/ updated comment).

> so I think we're in agreement to use existing patch w/ the new name
> FLAG_ORIG_PROGRESS

Yes, SGTM.

I guess this "prevent bogus FRTO undos" patch would go to "net" branch and
the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
branch?

neal

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-04-04 Thread Neal Cardwell

On Wed, Apr 4, 2018 at 6:42 AM Ilpo Järvinen 
wrote:

> On Wed, 28 Mar 2018, Yuchung Cheng wrote:
...
> > Your patch looks good. Some questions / comments:

The patch looks good to me as well (I read the Mar 28 version). Thanks for
this fix, Ilpo.

> Just to be sure that we understand each other the correct way around this
> time, I guess it resolved both of your "disagree" points above?

> > 1. Why only apply to CA_Loss and not also CA_Recovery? this may
> > mitigate GRO issue I noted in other thread.

> Hmm, that's indeed a good idea. I'll give it some more thought but my
> initial impression is that it should work.

Great. I would agree with Yuchung's suggestion to apply this fix to
CA_Recovery as well.

> > 2. minor style nit: can we adjust tp->delivered directly in
> > tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
> > (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
> > control when SACK is used. One day I'd like to rename all these *reno*
> > to _nonsack_.

> That's what I actually did first but put ended up putting it into own
> function because of line lengths (not a particularly good reason).

> ...So yes, I can put it into the tcp_clean_rtx_queue in the next version.

> I'll also try to keep that "reno" thing in my mind.

Either approach here sounds fine to me. Though personally I find it more
readable to keep all the non-SACK helpers together and consistently named,
including this one (even if the name is confusing, at least it is
consistent... :-).

neal

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-04-04 Thread Neal Cardwell

On Wed, Apr 4, 2018 at 6:35 AM Ilpo Järvinen 
wrote:

> On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >  wrote:
> > >
> > > If SACK is not enabled and the first cumulative ACK after the RTO
> > > retransmission covers more than the retransmitted skb, a spurious
> > > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > > The reason is that any non-retransmitted segment acknowledged will
> > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > > no indication that it would have been delivered for real (the
> > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > > case so the check for that bit won't help like it does with SACK).
> > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > > in tcp_process_loss.
> > >
> > > We need to use more strict condition for non-SACK case and check
> > > that none of the cumulatively ACKed segments were retransmitted
> > > to prove that progress is due to original transmissions. Only then
> > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > > non-SACK case.
> > >
> > > Signed-off-by: Ilpo Järvinen 
> > > ---
> > >  net/ipv4/tcp_input.c | 9 +
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 4a26c09..c60745c 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock
*sk, u32 prior_fack,
> > > pkts_acked = rexmit_acked +
newdata_acked;
> > >
> > > tcp_remove_reno_sacks(sk, pkts_acked);
> > > +
> > > +   /* If any of the cumulatively ACKed segments
was
> > > +* retransmitted, non-SACK case cannot
confirm that
> > > +* progress was due to original transmission
due to
> > > +* lack of TCPCB_SACKED_ACKED bits even if
some of
> > > +* the packets may have been never
retransmitted.
> > > +*/
> > > +   if (flag & FLAG_RETRANS_DATA_ACKED)
> > > +   flag &= ~FLAG_ORIG_SACK_ACKED;

FWIW I'd vote for this version.

> Of course I could put the back there but I really like the new place more
> (which was a result of your suggestion to place the code elsewhere).
> IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we
> weren't successful in proving (there in tcp_clean_rtx_queue) that progress
> was due original transmission and thus I would not want falsely indicate
> it with that flag. And there's the non-SACK related block anyway already
> there so it keeps the non-SACK "pollution" off from the SACK code paths.

I think that's a compelling argument. In particular, in terms of long-term
maintenance it seems risky to allow an unsound/buggy FLAG_ORIG_SACK_ACKED
bit be returned by tcp_clean_rtx_queue(). If we return an
incorrect/imcomplete FLAG_ORIG_SACK_ACKED bit then I worry that one day we
will forget that for non-SACK flows that bit is incorrect/imcomplete, and
we will add code using that bit but forgetting to check (tcp_is_sack(tp) ||
!FLAG_RETRANS_DATA_ACKED).

> (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to
> FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition
> we're after regardless of SACK and less ambiguous in non-SACK case).

I'm neutral on this. Not sure if the extra clarity is worth the code churn.

cheers,
neal

Re: SO_TCP_NODELAY implementation in TCP stack

2018-04-02 Thread Neal Cardwell

On Sun, Apr 1, 2018 at 4:06 AM Naruto Nguyen 
wrote:

> Hello everyone,

> As I know we have a socket option SO_TCP_NODELAY to disable Nagle
> Algorithm, and I found it is implemented in TCP/IP stack at
> https://elixir.bootlin.com/linux/v4.4.90/source/net/ipv4/tcp.c#L2401 .
> However, I do not know where the source code the Nagle Algorithm is
> implemented in kernel. If you know, could you please help me?

The Linux TCP code for Nagle's algorithm is largely in tcp_output.c
in the following functions:

   tcp_nagle_check(), tcp_minshall_check(), tcp_minshall_update()

The variant of the Nagle algorithm that Linux implements is described here:

   https://tools.ietf.org/html/draft-minshall-nagle-00

Basically, the algorithm avoids having more than one outstanding
unacknowledged packet that is less than one MSS in size.

The Nagle algorithm is disabled using the TCP_NODELAY socket option.

neal

Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-07 Thread Neal Cardwell

On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen  wrote:
> A bogus undo may/will trigger when the loss recovery state is
> kept until snd_una is above high_seq. If tcp_any_retrans_done
> is zero, retrans_stamp is cleared in this transient state. On
> the next ACK, tcp_try_undo_recovery again executes and
> tcp_may_undo will always return true because tcp_packet_delayed
> has this condition:
> return !tp->retrans_stamp || ...
>
> Check for the false fast retransmit transient condition in
> tcp_packet_delayed to avoid bogus undos. Since snd_una may have
> advanced on this ACK but CA state still remains unchanged,
> prior_snd_una needs to be passed instead of tp->snd_una.

This one also seems like a case where it would be nice to have a
specific packet-by-packet example, or trace, or packetdrill scenario.
Something that we might be able to translate into a test, or at least
to document the issue more explicitly.

Thanks!
neal

Re: [PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-07 Thread Neal Cardwell

On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen  wrote:
>
> In a non-SACK case, any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case). This causes bogus undos in ordinary RTO recoveries where
> segments are lost here and there, with a few delivered segments in
> between losses. A cumulative ACKs will cover retransmitted ones at
> the bottom and the non-retransmitted ones following that causing
> FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results
> in a spurious FRTO undo.
>
> We need to make the check more strict for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted,
> which would be the case for the last step of FRTO algorithm as we
> sent out only new segments previously. Only then, allow FRTO undo
> to proceed in non-SACK case.

Hi Ilpo - Do you have a packet trace or (even better) packetdrill
script illustrating this issue? It would be nice to have a test case
or at least concrete example of this.

Thanks!
neal

Re: [PATCH net-next 2/2] tcp_bbr: remove bbr->tso_segs_goal

2018-02-28 Thread Neal Cardwell

On Wed, Feb 28, 2018 at 5:40 PM, Eric Dumazet <eduma...@google.com> wrote:
> Its value is computed then immediately used,
> there is no need to store it.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

This one also looks great to me. Thanks, Eric!

neal

Re: [PATCH net-next 1/2] tcp_bbr: better deal with suboptimal GSO (II)

2018-02-28 Thread Neal Cardwell

On Wed, Feb 28, 2018 at 5:40 PM, Eric Dumazet <eduma...@google.com> wrote:
>
> This is second part of dealing with suboptimal device gso parameters.
> In first patch (350c9f484bde "tcp_bbr: better deal with suboptimal GSO")
> we dealt with devices having low gso_max_segs
>
> Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example)
>
> In order to probe an optimal cwnd, we want BBR being not sensitive
> to whatever GSO constraint a device can have.
>
> This patch removes tso_segs_goal() CC callback in favor of
> min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs
>
> Next patch will remove bbr->tso_segs_goal since it does not have
> to be persistent.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Looks great to me. Thanks, Eric!

neal

Re: [PATCH net] tcp: restrict F-RTO to work-around broken middle-boxes

2018-02-21 Thread Neal Cardwell

On Wed, Feb 21, 2018 at 7:38 AM, Teodor Milkov  wrote:
> Here they are:
>
>
>  http://vps3.avodz.org/tmp/frto-4.14.11-linux.pcap.xz - 3.3 MB/s
>
>  http://vps3.avodz.org/tmp/frto-4.14.11-windows.pcap.xz - connection
> completely froze and eventually timed out
>
>  http://vps3.avodz.org/tmp/frto-4.14.20+revert-windows.pcap.xz - 5+ MB/s,
> which almost saturated the link
>

Thanks for the detailed traces! This is hugely helpful, and nails down
what is happening here.

As the first screen shot shows, an excerpt from your
frto-4.14.11-windows.pcap.xz trace (windows receiver suffers stall),
there is a painful interaction between:

(a) A very broken middlebox in the path of this traffic that is
stripping *all* SACK options, so that receivers advertise SACK
capability but are thereafter unable to communicate to the sender all
the packets they have received.

(b) The F-RTO change you mention above: 89fe18e44 ("tcp: extend F-RTO
to catch more spurious timeouts"), which causes more undo operations,
which are implicitly optimized for the case where more packets will be
SACKed, which does not happen because of (a), so that there are
repeated RTO timeouts.

(c) A Windows receiver that does not implement TCP timestamps. This
means that, per the TCP standard, the sender is supposed to keep
exponentially backing off for each of these RTOs.

The combination of these 3 factors causes very long stalls.

But please note that even with this F-RTO patch fully reverted, the
middlebox that drops SACKs is causing horrendous and unnecessarily
slow recoveries (see the second screen shot, from your
frto-4.14.20+revert-windows.pcap.xz trace). It would be nice to report
this SACK-stripping issue to the middlebox vendor, if possible. Or
maybe there is a config option that can disable this feature.

It seems we will indeed need to revert 89fe18e44. We have a Google TCP
team member out of the office on vacation. When he's back we'll
consult and follow up with our consensus.

Thanks again for the report and the traces! This was hugely helpful.

neal

Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Neal Cardwell

On Wed, Feb 21, 2018 at 9:43 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
> From: Eric Dumazet <eduma...@google.com>
>
> BBR uses tcp_tso_autosize() in an attempt to probe what would be the
> burst sizes and to adjust cwnd in bbr_target_cwnd() with following
> gold formula :
>
> /* Allow enough full-sized skbs in flight to utilize end systems. */
> cwnd += 3 * bbr->tso_segs_goal;
>
> But GSO can be lacking or be constrained to very small
> units (ip link set dev ... gso_max_segs 2)
>
> What we really want is to have enough packets in flight so that both
> GSO and GRO are efficient.
>
> So in the case GSO is off or downgraded, we still want to have the same
> number of packets in flight as if GSO/TSO was fully operational, so
> that GRO can hopefully be working efficiently.
>
> To fix this issue, we make tcp_tso_autosize() unaware of
> sk->sk_gso_max_segs
>
> Only tcp_tso_segs() has to enforce the gso_max_segs limit.
>
> Tested:
>
> ethtool -K eth0 tso off gso off
> tc qd replace dev eth0 root pfifo_fast
>
> Before patch:
> for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> 691  (ss -temoi shows cwnd is stuck around 6 )
> 667
> 651
> 631
> 517
>
> After patch :
> # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
>1733 (ss -temoi shows cwnd is around 386 )
>1778
>1746
>1781
>1718
>
> Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name>
> ---
>  net/ipv4/tcp_output.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net] tcp: restrict F-RTO to work-around broken middle-boxes

2018-02-19 Thread Neal Cardwell

On Mon, Feb 19, 2018 at 11:17 AM, Teodor Milkov <t...@del.bg> wrote:
> On 19.02.2018 15:38, Neal Cardwell wrote:
>>
>> On Sun, Feb 18, 2018 at 4:02 PM, Teodor Milkov <t...@del.bg> wrote:
>>>
>>> Hello,
>>>
>>> I've numerous reports from Windows users that after kernel upgrade from
>>> 4.9
>>> to 4.14 they experienced major slow downs and transfer stalls.
>>>
>>> After some digging, I found that the slowness starts with this commit:
>>>
>>>   tcp: extend F-RTO to catch more spurious timeouts (89fe18e44)
>>>
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=89fe18e44f7ee5ab1c90d0dff5835acee7751427
>>>
>>> Which is partially reverted later with this one:
>>>
>>>   tcp: restrict F-RTO to work-around broken middle-boxes (cc663f4d4)
>>>
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc663f4d4c97b7297fb45135ab23cfd508b35a77
>>>
>>> But, still, we had stalls until I fully reverted 89fe18e44.
>>
>> Thanks for the report. Do you have any other details that might help
>> evaluate this issue?
>
>
> I'm sorry I didn't provide more info. It was long session.
>
>> Any packet traces, by any chance?
>
>
> I'll try and obtain one.

Great! Yes, if you could obtain a sender-side tcpdump that captured
one of these slow-down/stalls, that would be fantastic. It would be
great to be able to understand what's going on.

We only need headers, so something like the following would be fine:

   tcpdump -c200 -w /tmp/test.pcap -s 120 -i $ETH_DEVICE port $PORT

Then if you could post on a web server or Google drive, etc, that'd be great.

Thanks for all the additional details!

neal

Re: [PATCH net] tcp: restrict F-RTO to work-around broken middle-boxes

2018-02-19 Thread Neal Cardwell

On Sun, Feb 18, 2018 at 4:02 PM, Teodor Milkov  wrote:
> Hello,
>
> I've numerous reports from Windows users that after kernel upgrade from 4.9
> to 4.14 they experienced major slow downs and transfer stalls.
>
> After some digging, I found that the slowness starts with this commit:
>
>  tcp: extend F-RTO to catch more spurious timeouts (89fe18e44)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=89fe18e44f7ee5ab1c90d0dff5835acee7751427
>
> Which is partially reverted later with this one:
>
>  tcp: restrict F-RTO to work-around broken middle-boxes (cc663f4d4)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc663f4d4c97b7297fb45135ab23cfd508b35a77
>
> But, still, we had stalls until I fully reverted 89fe18e44.

Thanks for the report. Do you have any other details that might help
evaluate this issue? Any packet traces, by any chance? Were the
affected connections web browsing, videos, file transfer, etc? Were
there non-Windows users in this population that did not seem to be
affected by the stalls? Was the bottleneck primarily Ethernet, wifi,
cellular, cable modem, etc? Any middleboxes (firewall, NAT, etc)
between the servers and users? Does "stall" mean that the connection
permanently froze, or temporarily slowed down but eventually
recovered?

Thanks!

neal

Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Neal Cardwell

On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
 wrote:
>
> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Please note that there's no general rule about whether BBR will run
with a lower or higher cwnd than CUBIC, Reno, or other loss-based
congestion control algorithms. Whether BBR's cwnd will be lower or
higher depends on the BDP of the path, the amount of buffering in the
bottleneck, and the number of flows. BBR tries to match the amount of
in-flight data to the BDP based on the available bandwidth and the
two-way propagation delay. This will usually produce an amount of data
in flight that is smaller than CUBIC/Reno (yielding lower latency) if
the path has deep buffers (bufferbloat), but can be larger than
CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
the traffic is suffering burst losses.

>
>>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>>> transferring some files between hosts).
>
> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
>
> Connecting to host tux, port 5201
> [  5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec0204 KBytes
> [  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec0204 KBytes
> [  5]   2.00-3.00   sec   112 MBytes   941 Mbits/sec0204 KBytes
> [...]
>
> Running it locally gives the more or less expected results as well:
>
> Connecting to host ragnarok, port 5201
> [  5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
> [ ID] Interval   Transfer Bitrate Retr  Cwnd
> [  5]   0.00-1.00   sec  8.09 GBytes  69.5 Gbits/sec0512 KBytes
> [  5]   1.00-2.00   sec  8.14 GBytes  69.9 Gbits/sec0512 KBytes
> [  5]   2.00-3.00   sec  8.43 GBytes  72.4 Gbits/sec0512 KBytes
> [...]
>
> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).

Can you please clarify if this is over bare metal or between VM
guests? It sounds like Oleksandr's initial report was between KVM VMs,
so the virtualization may be an ingredient here.

thanks,
neal

Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 Thread Neal Cardwell

On Fri, Feb 16, 2018 at 11:43 AM, Eric Dumazet <eduma...@google.com> wrote:
>
> On Fri, Feb 16, 2018 at 8:33 AM, Neal Cardwell <ncardw...@google.com> wrote:
> > Oleksandr,
> >
> > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We
> > have not run into this one in our team, but we will try to work with you to
> > fix this.
> >
> > Would you be able to take a sender-side tcpdump trace of the slow BBR
> > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be
> > fine. Maybe something like:
> >
> >   tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT
> >
> > Thanks!
> > neal
>
> On baremetal and using latest net tree, I get pretty normal results at
> least, on 40Gbit NIC,

Eric raises a good question: bare metal vs VMs.

Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your
second e-mail did not seem to mention if those results were for bare
metal or a VM scenario: can you please clarify the details on your
second set of tests?

Thanks!

[PATCH net] tcp_bbr: fix pacing_gain to always be unity when using lt_bw

2018-01-31 Thread Neal Cardwell

This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when
using lt_bw and returning from the PROBE_RTT state to PROBE_BW.

Previously, when using lt_bw, upon exiting PROBE_RTT and entering
PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly
end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase()
set a pacing gain above 1.0. In such cases this would result in a
pacing rate that is 1.25x higher than intended, potentially resulting
in a high loss rate for a little while until we stop using the lt_bw a
bit later.

This commit is a stable candidate for kernels back as far as 4.9.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Reported-by: Beyers Cronje <bcro...@gmail.com>
---
 net/ipv4/tcp_bbr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 8322f26e770e..25c5a0b60cfc 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -481,7 +481,8 @@ static void bbr_advance_cycle_phase(struct sock *sk)
 
bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
bbr->cycle_mstamp = tp->delivered_mstamp;
-   bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
+   bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT :
+   bbr_pacing_gain[bbr->cycle_idx];
 }
 
 /* Gain cycling: cycle pacing gain to converge to fair share of available bw. 
*/
@@ -490,8 +491,7 @@ static void bbr_update_cycle_phase(struct sock *sk,
 {
struct bbr *bbr = inet_csk_ca(sk);
 
-   if ((bbr->mode == BBR_PROBE_BW) && !bbr->lt_use_bw &&
-   bbr_is_next_cycle_phase(sk, rs))
+   if (bbr->mode == BBR_PROBE_BW && bbr_is_next_cycle_phase(sk, rs))
bbr_advance_cycle_phase(sk);
 }
 
-- 
2.16.0.rc1.238.g530d649a79-goog

Re: [PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Neal Cardwell

On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo  wrote:
> On 1/23/18, 9:30 AM, "Yuchung Cheng"  wrote:
>
> The original patch that changes TCP's congestion control via eBPF only
> re-initializes the new congestion control, if the BPF op is set to an
> (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP will
>
> What do you mean by “(invalid) value”?
>
> run the new congestion control from random states.
>
> This has always been possible with setsockopt, no?
>
>This patch fixes
> the issue by always re-init the congestion control like other means
> such as setsockopt and sysctl changes.
>
> The current code re-inits the congestion control when calling
> tcp_set_congestion_control except when it is called early on (i.e. op <=
> BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to re-initialize
> since it will be initialized later by TCP when the connection is established.
>
> Otherwise, if we always call tcp_reinit_congestion_control we would call
> tcp_cleanup_congestion_control when the congestion control has not been
> initialized yet.

On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo  wrote:
> On 1/23/18, 9:30 AM, "Yuchung Cheng"  wrote:
>
> The original patch that changes TCP's congestion control via eBPF only
> re-initializes the new congestion control, if the BPF op is set to an
> (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP will
>
> What do you mean by “(invalid) value”?
>
> run the new congestion control from random states.
>
> This has always been possible with setsockopt, no?
>
>This patch fixes
> the issue by always re-init the congestion control like other means
> such as setsockopt and sysctl changes.
>
> The current code re-inits the congestion control when calling
> tcp_set_congestion_control except when it is called early on (i.e. op <=
> BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to re-initialize
> since it will be initialized later by TCP when the connection is established.
>
> Otherwise, if we always call tcp_reinit_congestion_control we would call
> tcp_cleanup_congestion_control when the congestion control has not been
> initialized yet.

Interesting. So I wonder if the symptoms we were seeing were due to
kernels that did not yet have this fix:

  27204aaa9dc6 ("tcp: uniform the set up of sockets after successful
connection):
  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=27204aaa9dc67b833b77179fdac556288ec3a4bf

Before that fix, there could be TFO passive connections that at SYN time called:
  tcp_init_congestion_control(child);
and then:
  tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);

So that if the CC was switched in the
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB handler then there would be no
init for the new module?

neal

Re: Linux ECN Handling

2018-01-03 Thread Neal Cardwell

On Tue, Jan 2, 2018 at 6:57 PM, Steve Ibanez  wrote:
> Hi Neal,
>
> Sorry, my last email was incorrect. It turns out the default tcp
> congestion control alg that was being used on my client machines was
> cubic instead of dctcp. That is why tp->processing_cwr field was never
> set in the tcp_rcv_established function. I've changed the default back
> to dctcp on all of my machines.
>
> I am now logging the value of tp->rcv_nxt at the top of the
> tcp_transmit_skb() function for all CWR segments. I see that during
> normal operation, the value of tp->rcv_nxt is equal to the SeqNo in
> the CWR segment  + length of the CWR segment.

OK, thanks. That makes sense.

This part I didn't understand:

> However, for the unACKed
> CWR segment, the value of tp->rcv_nxt is just equal to the SeqNo in
> the CWR segment (i.e. not incremented by the length). And I see that
> by the time the tcp_ack_snd_check() function is executed, tp->rcv_nxt
> has been incremented by the length of the unACKed CWR segment.

I would have thought that for the processing of the skb that has the
CWR, the sequence would be:

(1)  "...the tcp_ack_snd_check() function is executed, tp->rcv_nxt has
been incremented by the length of the unACKed CWR segment"

(2) then we send the ACK, and the instrumentation at the top of the
tcp_transmit_skb() function logs that rcv_nxt value (which "has been
incremented by the length of the unACKed CWR segment").

But you are saying "for the unACKed CWR segment, the value of
tp->rcv_nxt is just equal to the SeqNo in the CWR segment (i.e. not
incremented by the length)", which does not seem to match my
prediction in (2). Apparently I am mis-understanding the sequence.
Perhaps you can help clear it up for me? :-)

Is it possible that the case where you see "tp->rcv_nxt is just equal
to the SeqNo in the CWR segment" is a log line that was logged while
processing the skb that precedes the skb with the CWR?

> The tcp_transmit_skb() function sets the outgoing segment's ack_seq to
> be tp->rcv_next:
>
> th->ack_seq = htonl(tp->rcv_nxt);
>
> So I think the rcv_nxt field is supposed to be incremented before
> reaching tcp_transmit_skb(). Can you see any reason as to why this
> field would not be incremented for CWR segments sometimes?

No, so far I haven't been able to think of a reason why rcv_nxt would
not be incremented for in-order CWR-marked segments...

cheers,
neal

Re: Linux ECN Handling

2018-01-02 Thread Neal Cardwell

On Tue, Jan 2, 2018 at 2:43 AM, Steve Ibanez  wrote:
> Hi Neal,
>
> Apologies for the delay, and happy new year!
>
> To answer your question, data is only transferred in one direction
> (from the client to the server). The SeqNo in the pkts from the server
> to the client is not incremented. So I don't think that a data pkt is
> attempted to be sent in the tcp_data_snd_check() call clearing the
> ICSK_ACK_SCHED bit. Although I think it would be helpful to include
> your new debugging field in the tcp_sock (tp->processing_cwr) so that
> I can check this field in the tcp_transmit_skb() and tcp_send_ack()
> functions. I added the new field and tried to set it at the top of the
> tcp_rcv_established(), but then when I try to check the field in the
> tcp_send_ack() function it never appears to be set. Below I'm showing
> how I set the tp->processing_cwr field in the tcp_rcv_established
> function and how I check it in the tcp_send_ack function. Is this how
> you were imagining the processing_cwr field to be used?

Happy new year to you as well, and thank you, Steve, for running this
experiment! Yes, this is basically the kind of thing I had in mind.

The connection will run the "fast path" tcp_rcv_established() code if
the connection is in the ESTABLISHED state.  From the symptoms it
sounds like what's happening is that in this test the connection is
not in the ESTABLISHED state when the CWR arrives, so it's probably
running the more general tcp_rcv_state_process() function. I would
suggest adding your tp->processing_cwr instrumentation at the top of
tcp_rcv_state_process() as well, and then re-running the test. (In
tcp_v4_do_rcv() and tcp_v6_do_rcv(), for each incoming skb one of
those two functions is called).

It is interesting that the connection does not seem to be in the
ESTABLISHED state. Maybe that is an ingredient of the unexpected
behavior in this case...

Thanks!
neal

Re: Linux ECN Handling

2017-12-20 Thread Neal Cardwell

On Wed, Dec 20, 2017 at 2:20 PM, Steve Ibanez  wrote:
>
> Hi Neal,
>
> I added in some more printk statements and it does indeed look like
> all of these calls you listed are being invoked successfully. I guess
> this isn't too surprising given what the inet_csk_schedule_ack() and
> inet_csk_ack_scheduled() functions are doing:
>
> static inline void inet_csk_schedule_ack(struct sock *sk)
> {
> inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_SCHED;
> }
>
> static inline int inet_csk_ack_scheduled(const struct sock *sk)
> {
> return inet_csk(sk)->icsk_ack.pending & ICSK_ACK_SCHED;
> }
>
> So through the code path that you listed, the inet_csk_schedule_ack()
> function sets the ICSK_ACK_SCHED bit and then the tcp_ack_snd_check()
> function just checks that the ICSK_ACK_SCHED bit is indeed set.
>inet_csk_schedule_ack
> Do you know how I can verify that setting the ICSK_ACK_SCHED bit
> actually results in an ACK being sent?

Hmm. I don't think in this case we can verify that setting the
ICSK_ACK_SCHED bit actually results in an ACK being sent. Because
AFAICT in this case it seems like an ACK is not sent. :-) This is
based on both the tcpdumps on Dec 5 and your detective work yesterday
("The tcp_rcv_established() function calls tcp_ack_snd_check() at the
end of step5 and then the return statement indicated below is invoked,
which prevents the __tcp_ack_snd_check() function from running.")

So AFAICT the puzzle is: how is the icsk_ack.pending  ICSK_ACK_SCHED
bit being cleared between the inet_csk_schedule_ack() call and the
tcp_ack_snd_check() call, without (apparently) an actual ACK being
sent on the wire?

AFAICT the ICSK_ACK_SCHED bit is not supposed to be cleared unless we
get to this sequence:

tcp_transmit_skb()
  if (likely(tcb->tcp_flags & TCPHDR_ACK))
tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
 -> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
icsk->icsk_ack.blocked = icsk->icsk_ack.pending = 0;

I don't have a theory that fits all of those data points, unless this
is a bi-directional transfer (is it?) and between the
inet_csk_schedule_ack() call and the tcp_ack_snd_check() call the TCP
connection sends a data packet (in tcp_data_snd_check()) and then it
is dropped for some reason before the packet make it to the tcpdump
sniffing point. Perhaps because of a qdisc limit or something?

I guess a possible next step would be, while processing an incoming
skb with the cwr bit set, the code could set a new debugging field in
the tcp_sock (tp->processing_cwr), and then you could check this field
in tcp_transmit_skb() and printk if (1) there is an attempted
queue_xmit() cal and (2) if the queue_xmit() fails (the err > 0 case).

That's a long shot, but the only idea I have at this point.

thanks,
neal

Re: Linux ECN Handling

2017-12-19 Thread Neal Cardwell

On Tue, Dec 19, 2017 at 5:00 PM, Steve Ibanez  wrote:
> Hi Neal,
>
> I managed to track down the code path that the unACKed CWR packet is
> taking. The tcp_rcv_established() function calls tcp_ack_snd_check()
> at the end of step5 and then the return statement indicated below is
> invoked, which prevents the __tcp_ack_snd_check() function from
> running.
>
> static inline void tcp_ack_snd_check(struct sock *sk)
> {
> if (!inet_csk_ack_scheduled(sk)) {
> /* We sent a data segment already. */
> return;   /* <=== here */
> }
> __tcp_ack_snd_check(sk, 1);
> }
>
> So somehow tcp_ack_snd_check() thinks that a data segment was already
> sent when in fact it wasn't. Do you see a way around this issue?

Thanks for tracking that down! AFAICT in this case the call chain we
are trying to achieve is as follows:

tcp_rcv_established()
 -> tcp_data_queue()
 -> tcp_event_data_recv()
 -> inet_csk_schedule_ack()

The only think I can think of would be to add printks that fire for
CWR packets, to isolate why the code bails out before it reaches those
calls...

thanks,
neal

Re: Linux ECN Handling

2017-12-19 Thread Neal Cardwell

On Tue, Dec 19, 2017 at 12:16 AM, Steve Ibanez  wrote:
>
> Hi Neal,
>
> I started looking into this receiver ACKing issue today. Strangely,
> when I tried adding printk statements at the top of the
> tcp_v4_do_rcv(), tcp_rcv_established(), __tcp_ack_snd_check() and
> tcp_send_delayed_ack() functions they were never executed on the
> machine running the iperf3 server (i.e. the destination of the flows).
> Maybe the iperf3 server is using its own TCP stack?
>
> In any case, the ACKing problem is reproducible using just normal
> iperf for which I do see my printk statements being executed. I can
> now confirm that when the CWR marked packet (for which no ACK is sent)
> arrives at the receiver, the __tcp_ack_snd_check() function is never
> invoked; and hence neither is the tcp_send_delayed_ack() function.
> Hopefully this helps narrow down where the issue might be? I started
> adding some printk statements into the tcp_rcv_established() function,
> but I'm not sure where the best places to look would be so I wanted to
> ask your advice on this.

Thanks for the detailed report!

As a next step to narrow down why the CWR-marked packet is not acked,
I would suggest adding printk statements at the bottom of
tcp_rcv_established() in all the spots where we have a goto or return
that would cause us to short-circuit and not reach the
tcp_ack_snd_check() at the bottom of the function. This could be much
like your existing nice debugging printks, that log any time we get to
that spot and if (sk->sk_family == AF_INET && th->cwr). And these
could be in the following spots (marked "here"):

slow_path:
if (len < (th->doff << 2) || tcp_checksum_complete(skb))
goto csum_error;/* <=== here */

if (!th->ack && !th->rst && !th->syn)
goto discard;/* <=== here */

/*
 *  Standard slow path.
 */

if (!tcp_validate_incoming(sk, skb, th, 1))
return;/* <=== here */

step5:
if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0)
goto discard;/* <=== here */


thanks,
neal

Re: [PATCH iproute2 1/1] ss: add missing path MTU parameter

2017-12-14 Thread Neal Cardwell

On Thu, Dec 14, 2017 at 2:23 PM, Roman Mashak  wrote:
>
> Signed-off-by: Roman Mashak 
> ---
...
> @@ -1967,6 +1968,8 @@ static void tcp_stats_print(struct tcpstat *s)
> printf(" cwnd:%u", s->cwnd);
> if (s->ssthresh)
> printf(" ssthresh:%d", s->ssthresh);
> +   if (s->pmtu)
> +   printf(" pmtu:%u", s->pmtu);

Would it be possible to print the pmtu immediately after the mss? IMHO
having related parameters next to each other this way would make this
easier to parse for humans.

Thanks for adding this!

cheers,
neal

Re: [PATCH net] tcp: fix potential underestimation on rcv_rtt

2017-12-13 Thread Neal Cardwell

On Tue, Dec 12, 2017 at 7:28 PM, Wei Wang <wei...@google.com> wrote:
> From: Wei Wang <wei...@google.com>
>
> When ms timestamp is used, current logic uses 1us in
> tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
> This could cause rcv_rtt underestimation.
> Fix it by always using a min value of 1ms if ms timestamp is used.
>
> Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high
> resolution timestamps")
>
> Signed-off-by: Wei Wang <wei...@google.com>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks!

neal

Re: [PATCH net] tcp: refresh tcp_mstamp from timers callbacks

2017-12-12 Thread Neal Cardwell

On Tue, Dec 12, 2017 at 9:22 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> Only the retransmit timer currently refreshes tcp_mstamp
>
> We should do the same for delayed acks and keepalives.
>
> Even if RFC 7323 does not request it, this is consistent to what linux
> did in the past, when TS values were based on jiffies.
>
> Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Mike Maloney <malo...@google.com>
> Cc: Neal Cardwell <ncardw...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net-next 3/3] tcp: smoother receiver autotuning

2017-12-11 Thread Neal Cardwell

On Sun, Dec 10, 2017 at 8:55 PM, Eric Dumazet <eduma...@google.com> wrote:
> Back in linux-3.13 (commit b0983d3c9b13 ("tcp: fix dynamic right sizing"))
> I addressed the pressing issues we had with receiver autotuning.
>
> But DRS suffers from extra latencies caused by rcv_rtt_est.rtt_us
> drifts. One common problem happens during slow start, since the
> apparent RTT measured by the receiver can be inflated by ~50%,
> at the end of one packet train.
>
> Also, a single drop can delay read() calls by one RTT, meaning
> tcp_rcv_space_adjust() can be called one RTT too late.
>
> By replacing the tri-modal heuristic with a continuous function,
> we can offset the effects of not growing 'at the optimal time'.
>
> The curve of the function matches prior behavior if the space
> increased by 25% and 50% exactly.
>
> Cost of added multiply/divide is small, considering a TCP flow
> typically would run this part of the code few times in its life.
>
> I tested this patch with 100 ms RTT / 1% loss link, 100 runs
> of (netperf -l 5), and got an average throughput of 4600 Mbit
> instead of 1700 Mbit.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
> Acked-by: Wei Wang <wei...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net-next 2/3] tcp: avoid integer overflows in tcp_rcv_space_adjust()

2017-12-11 Thread Neal Cardwell

On Sun, Dec 10, 2017 at 8:55 PM, Eric Dumazet <eduma...@google.com> wrote:
> When using large tcp_rmem[2] values (I did tests with 500 MB),
> I noticed overflows while computing rcvwin.
>
> Lets fix this before the following patch.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
> Acked-by: Wei Wang <wei...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net-next 1/3] tcp: do not overshoot window_clamp in tcp_rcv_space_adjust()

2017-12-11 Thread Neal Cardwell

On Sun, Dec 10, 2017 at 8:55 PM, Eric Dumazet <eduma...@google.com> wrote:
>
> While rcvbuf is properly clamped by tcp_rmem[2], rcvwin
> is left to a potentially too big value.
>
> It has no serious effect, since :
> 1) tcp_grow_window() has very strict checks.
> 2) window_clamp can be mangled by user space to any value anyway.
>
> tcp_init_buffer_space() and companions use tcp_full_space(),
> we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
> Acked-by: Wei Wang <wei...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net 0/3] TCP BBR sampling fixes for loss recovery undo

2017-12-07 Thread Neal Cardwell

On Thu, Dec 7, 2017 at 12:43 PM, Neal Cardwell <ncardw...@google.com> wrote:
> This patch series has a few minor bug fixes for cases where spurious
> loss recoveries can trick BBR estimators into estimating that the
> available bandwidth is much lower than the true available bandwidth.
> In both cases the fix here is to just reset the estimator upon loss
> recovery undo.
>
> Neal Cardwell (3):
>   tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
>   tcp_bbr: reset full pipe detection on loss recovery undo
>   tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
>
>  net/ipv4/tcp_bbr.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)

Sorry, I should have mentioned these patches are all stable candidates
and should have all carried a footer of:

  Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")

David, I can resend a v2 with that footer on each patch if you want.

neal

[PATCH net 0/3] TCP BBR sampling fixes for loss recovery undo

2017-12-07 Thread Neal Cardwell

This patch series has a few minor bug fixes for cases where spurious
loss recoveries can trick BBR estimators into estimating that the
available bandwidth is much lower than the true available bandwidth.
In both cases the fix here is to just reset the estimator upon loss
recovery undo.

Neal Cardwell (3):
  tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
  tcp_bbr: reset full pipe detection on loss recovery undo
  tcp_bbr: reset long-term bandwidth sampling on loss recovery undo

 net/ipv4/tcp_bbr.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

-- 
2.15.1.424.g9478a66081-goog

[PATCH net 2/3] tcp_bbr: reset full pipe detection on loss recovery undo

2017-12-07 Thread Neal Cardwell

Fix BBR so that upon notification of a loss recovery undo BBR resets
the full pipe detection (STARTUP exit) state machine.

Under high reordering, reordering events can be interpreted as loss.
If the reordering and spurious loss estimates are high enough, this
could previously cause BBR to spuriously estimate that the pipe is
full.

Since spurious loss recovery means that our overall sending will have
slowed down spuriously, this commit gives a flow more time to probe
robustly for bandwidth and decide the pipe is really full.

Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Yuchung Cheng <ych...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
---
 net/ipv4/tcp_bbr.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 3089c956b9f9..ab3ff14ea7f7 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -874,6 +874,10 @@ static u32 bbr_sndbuf_expand(struct sock *sk)
  */
 static u32 bbr_undo_cwnd(struct sock *sk)
 {
+   struct bbr *bbr = inet_csk_ca(sk);
+
+   bbr->full_bw = 0;   /* spurious slow-down; reset full pipe detection */
+   bbr->full_bw_cnt = 0;
return tcp_sk(sk)->snd_cwnd;
 }
 
-- 
2.15.1.424.g9478a66081-goog

[PATCH net 3/3] tcp_bbr: reset long-term bandwidth sampling on loss recovery undo

2017-12-07 Thread Neal Cardwell

Fix BBR so that upon notification of a loss recovery undo BBR resets
long-term bandwidth sampling.

Under high reordering, reordering events can be interpreted as loss.
If the reordering and spurious loss estimates are high enough, this
can cause BBR to spuriously estimate that we are seeing loss rates
high enough to trigger long-term bandwidth estimation. To avoid that
problem, this commit resets long-term bandwidth sampling on loss
recovery undo events.

Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Yuchung Cheng <ych...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
---
 net/ipv4/tcp_bbr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index ab3ff14ea7f7..8322f26e770e 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -878,6 +878,7 @@ static u32 bbr_undo_cwnd(struct sock *sk)
 
bbr->full_bw = 0;   /* spurious slow-down; reset full pipe detection */
bbr->full_bw_cnt = 0;
+   bbr_reset_lt_bw_sampling(sk);
return tcp_sk(sk)->snd_cwnd;
 }
 
-- 
2.15.1.424.g9478a66081-goog

[PATCH net 1/3] tcp_bbr: record "full bw reached" decision in new full_bw_reached bit

2017-12-07 Thread Neal Cardwell

This commit records the "full bw reached" decision in a new
full_bw_reached bit. This is a pure refactor that does not change the
current behavior, but enables subsequent fixes and improvements.

In particular, this enables simple and clean fixes because the full_bw
and full_bw_cnt can be unconditionally zeroed without worrying about
forgetting that we estimated we filled the pipe in Startup. And it
enables future improvements because multiple code paths can be used
for estimating that we filled the pipe in Startup; any new code paths
only need to set this bit when they think the pipe is full.

Note that this fix intentionally reduces the width of the full_bw_cnt
counter, since we have never used the most significant bit.

Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Yuchung Cheng <ych...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
---
 net/ipv4/tcp_bbr.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 69ee877574d0..3089c956b9f9 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -110,7 +110,8 @@ struct bbr {
u32 lt_last_lost;/* LT intvl start: tp->lost */
u32 pacing_gain:10, /* current gain for setting pacing rate */
cwnd_gain:10,   /* current gain for setting cwnd */
-   full_bw_cnt:3,  /* number of rounds without large bw gains */
+   full_bw_reached:1,   /* reached full bw in Startup? */
+   full_bw_cnt:2,  /* number of rounds without large bw gains */
cycle_idx:3,/* current index in pacing_gain cycle array */
has_seen_rtt:1, /* have we seen an RTT sample yet? */
unused_b:5;
@@ -180,7 +181,7 @@ static bool bbr_full_bw_reached(const struct sock *sk)
 {
const struct bbr *bbr = inet_csk_ca(sk);
 
-   return bbr->full_bw_cnt >= bbr_full_bw_cnt;
+   return bbr->full_bw_reached;
 }
 
 /* Return the windowed max recent bandwidth sample, in pkts/uS << BW_SCALE. */
@@ -717,6 +718,7 @@ static void bbr_check_full_bw_reached(struct sock *sk,
return;
}
++bbr->full_bw_cnt;
+   bbr->full_bw_reached = bbr->full_bw_cnt >= bbr_full_bw_cnt;
 }
 
 /* If pipe is probably full, drain the queue and then enter steady-state. */
@@ -850,6 +852,7 @@ static void bbr_init(struct sock *sk)
bbr->restore_cwnd = 0;
bbr->round_start = 0;
bbr->idle_restart = 0;
+   bbr->full_bw_reached = 0;
bbr->full_bw = 0;
bbr->full_bw_cnt = 0;
bbr->cycle_mstamp = 0;
-- 
2.15.1.424.g9478a66081-goog

Re: [PATCH net] tcp: use current time in tcp_rcv_space_adjust()

2017-12-06 Thread Neal Cardwell

On Wed, Dec 6, 2017 at 2:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> When I switched rcv_rtt_est to high resolution timestamps, I forgot
> that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust()
>
> Using an old timestamp leads to autotuning lags.
>
> Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high 
> resolution timestamps")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Wei Wang <wei...@google.com>
> Cc: Neal Cardwell <ncardw...@google.com>
> Cc: Yuchung Cheng <ych...@google.com>
> ---

Acked-by: Neal Cardwell <ncardw...@google.com>

neal

Re: Linux ECN Handling

2017-12-05 Thread Neal Cardwell

On Tue, Dec 5, 2017 at 12:22 AM, Steve Ibanez  wrote:
> Hi Neal,
>
> Happy to help out :) And thanks for the tip!
>
> I was able to track down where the missing bytes that you pointed out
> are being lost. It turns out the destination host seems to be
> misbehaving. I performed a packet capture at the destination host
> interface (a snapshot of the trace is attached). I see the following
> sequence of events when a timeout occurs (note that I have NIC
> offloading enabled so wireshark captures packets larger than the MTU):
>
> 1. The destination receives a data packet of length X with seqNo = Y
> from the src with the CWR bit set and does not send back a
> corresponding ACK.
> 2. The source times out and sends a retransmission packet of length Z
> (where Z < X) with seqNo = Y
> 3. The destination sends back an ACK with AckNo = Y + X
>
> So in other words, the packet which the destination host does not
> initially ACK (causing the timeout) does not actually get lost because
> after receiving the retransmission the AckNo moves forward all the way
> past the bytes in the initial unACKed CWR packet. In the attached
> screenshot, I've marked the unACKed CWR packet with a red box.
>
> Have you seen this behavior before? And do you know what might be
> causing the destination host not to ACK the CWR packet? In most cases
> the CWR marked packets are ACKed properly, it's just occasionally they
> are not.

Thanks for the detailed report!

I have not heard of an incoming CWR causing the receiver to fail to
ACK. And in re-reading the code, I don't see an obvious way in which a
CWR bit should cause the receiver to fail to ACK.

That screen shot is a bit hard to parse. Would you be able to post a
tcpdump .pcap of that particular section, or post a screen shot of a
time-sequence plot of that section?

To extract that segment and take screen shot, you could use something like:

  editcap -A "2017-12-04 11:22:27"  -B "2017-12-04 11:22:30"  all.pcap
slice.pcap
  tcptrace -S -xy -zy slice.pcap
  xplot.org a2b_tsg.xpl &
  # take screenshot

Or, alternatively, would you be able to post the slice.pcap on a web
server or public drive?

thanks,
neal

Re: Linux ECN Handling

2017-12-01 Thread Neal Cardwell

On Mon, Nov 27, 2017 at 1:49 PM, Steve Ibanez  wrote:
>
> Hi Neal,
>
> I tried out your new suggested patches and indeed it looks like it is
> working. The duration of the freezes looks like it has reduced from an
> RTO to 10ms (tcp probe reports SRTT measurements of about 200us just
> before the freeze). So the PTO value seems to be correctly set to
> max(2*SRTT, 10ms).

Great. Thank you for testing this!

Our team will look into testing these patches more and sending some
version of them upstream if things look good.

Also, BTW, in newer kernels, with bb4d991a28cc ("tcp: adjust tail loss
probe timeout") from July, the TLP timeout should be closer to 2*SRTT
+ 2 jiffies, so if your kernel has 1ms jiffies this should further
improve things.

Thanks,
neal

Re: Linux ECN Handling

2017-11-21 Thread Neal Cardwell

On Tue, Nov 21, 2017 at 10:02 PM, Steve Ibanez <siba...@stanford.edu> wrote:
> Hi Neal,
>
> I just tried out your fix for enabling TLPs in the CWR state (while
> leaving tcp_tso_should_defer() unchanged), but I'm still seeing the
> host enter long timeouts. Feel free to let me know if there is
> something else you'd like me to try.

Oh, interesting. That was surprising to me, until I re-read the TLP
code. I think the TLP code is accidentally preventing the TLP timer
from being set in cases where TSO deferral is using cwnd unused,
because of this part of the logic:

  if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
  !tcp_write_queue_empty(sk))
return false;

AFAICT it would be great for the TLP timer to be set if TSO deferral
decides not to send. That way the TLP timer firing can unwedge the
connection (in only a few milliseconds in LAN cases) if TSO deferral
decides to defer sending and ACK stop arriving. Removing those 3 lines
might allow TLP to give us much of the benefit of having a timer to
unwedge things after TSO deferral, without adding any new timers or
code.

If you have time, would you be able to try the following two patches
together in your test set-up?

(1)
commit 1ade85cd788cfed0433a83da03e299f396769c73
Author: Neal Cardwell <ncardw...@google.com>
Date:   Tue Nov 21 22:33:30 2017 -0500

tcp: allow TLP in CWR

(Also allows TLP in disorder, though this is somewhat academic, since
in disorder RACK will almost always override the TLP timer with a
reorder timeout.)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4ea79b2ad82e..deccf8070f84 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2536,11 +2536,11 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)

early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
/* Schedule a loss probe in 2*RTT for SACK capable connections
-* in Open state, that are either limited by cwnd or application.
+* not in loss recovery, that are either limited by cwnd or application.
 */
if ((early_retrans != 3 && early_retrans != 4) ||
!tp->packets_out || !tcp_is_sack(tp) ||
-   icsk->icsk_ca_state != TCP_CA_Open)
+   icsk->icsk_ca_state >= TCP_CA_Recovery)
return false;

if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&

(2)
commit ccd377a601c14dc82826720d93afb573a388022e (HEAD ->
gnetnext8xx-tlp-recalc-rto-on-ack-fix-v4)
Author: Neal Cardwell <ncardw...@google.com>
Date:   Tue Nov 21 22:34:42 2017 -0500

tcp: allow scheduling TLP timer if TSO deferral leaves some cwnd unused

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index deccf8070f84..1724cc2bbf1a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2543,10 +2543,6 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)
icsk->icsk_ca_state >= TCP_CA_Recovery)
return false;

-   if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
-!tcp_write_queue_empty(sk))
-   return false;
-
/* Probe timeout is 2*rtt. Add minimum RTO to account
 * for delayed ack when there's one outstanding packet. If no RTT
 * sample is available then probe after TCP_TIMEOUT_INIT.

Thanks!

neal

Re: Linux ECN Handling

2017-11-21 Thread Neal Cardwell

On Tue, Nov 21, 2017 at 10:51 AM, Yuchung Cheng <ych...@google.com> wrote:
> On Tue, Nov 21, 2017 at 7:01 AM, Neal Cardwell <ncardw...@google.com> wrote:
>>
>> The original motivation for only allowing TLP in the CA_Open state was
>> to be conservative and avoid having the TLP impose extra load on the
>> bottleneck when it may be congested. Plus if there are any SACKed
>> packets in the SACK scoreboard then there are other existing
>> mechanisms to do speedy loss recovery.
> Neal I like your idea of covering more states in TLP. but shouldn't we
> also fix the tso_deferral_logic to work better w/ PRR in CWR state, b/c
> it's a general transmission issue.

Yes, I agree it's also worthwhile to see if we can make PRR and TSO
deferral play well together. Sorry, I should have been more clear
about that.

neal

Re: Linux ECN Handling

2017-11-21 Thread Neal Cardwell

On Tue, Nov 21, 2017 at 12:58 AM, Steve Ibanez  wrote:
> Hi Neal,
>
> I tried your suggestion to disable tcp_tso_should_defer() and it does
> indeed look like it is preventing the host from entering timeouts.
> I'll have to do a bit more digging to try and find where the packets
> are being dropped. I've verified that the bottleneck link queue is
> capacity is at about the configured marking threshold when the timeout
> occurs, so the drops may be happening at the NIC interfaces or perhaps
> somewhere unexpected in the switch.

Great! Thanks for running that test.

> I wonder if you can explain why the TLP doesn't fire when in the CWR
> state? It seems like that might be worth having for cases like this.

The original motivation for only allowing TLP in the CA_Open state was
to be conservative and avoid having the TLP impose extra load on the
bottleneck when it may be congested. Plus if there are any SACKed
packets in the SACK scoreboard then there are other existing
mechanisms to do speedy loss recovery.

But at various times we have talked about expanding the set of
scenarios where TLP is used. And I think this example demonstrates
that there is a class of real-world cases where it probably makes
sense to allow TLP in the CWR state.

If you have time, would you be able to check if leaving
tcp_tso_should_defer () as-is but enabling TLP probes in CWR state
also fixes your performance issue? Perhaps something like
(uncompiled/untested):

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4ea79b2ad82e..deccf8070f84 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2536,11 +2536,11 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)

early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
/* Schedule a loss probe in 2*RTT for SACK capable connections
-* in Open state, that are either limited by cwnd or application.
+* not in loss recovery, that are either limited by cwnd or application.
 */
if ((early_retrans != 3 && early_retrans != 4) ||
!tp->packets_out || !tcp_is_sack(tp) ||
-   icsk->icsk_ca_state != TCP_CA_Open)
+   icsk->icsk_ca_state >= TCP_CA_Recovery)
return false;

if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&

> Btw, thank you very much for all the help! It is greatly appreciated :)

You are very welcome! :-)

cheers,
neal

Re: Linux ECN Handling

2017-11-20 Thread Neal Cardwell

On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez  wrote:
> Hi Folks,
>
> I wanted to check back in on this for another update and to solicit
> some more suggestions. I did a bit more digging to try an isolate the
> problem.

Going back to one of your Oct 19 trace snapshots (attached), AFAICT at
the time of the timeout there is actually almost 64KBytes  (352553398
+ 1448 - 352489686 = 65160) of unacknowledged data. So there really
does seem to be a significant chunk of packets that were in-flight
that were then declared lost.

So here is a possibility: perhaps the combination of CWR+PRR plus
tcp_tso_should_defer() means that PRR can make cwnd so gentle that
tcp_tso_should_defer() thinks we should wait for another ACK to send,
and that ACK doesn't come. Breaking it, down, the potential sequence
would be:

(1) tcp_write_xmit() does not send, because the CWR behavior, using
PRR, does not leave enough cwnd for tcp_tso_should_defer() to think we
should send (PRR was originally designed for recovery, which did not
have TSO deferral)

(2) TLP does not fire, because we are in state CWR, not Open

(3) The only remaining option is an RTO, which fires.

In other words, the possibility is that, at the time of the stall, the
cwnd is reasonably high, but tcp_packets_in_flight() is also quite
high, so either there is (a) literally no unused cwnd left (
tcp_packets_in_flight() == cwnd), or (b) some mechanism like
tcp_tso_should_defer() is deciding that there is not enough available
cwnd for it to make sense to chop off a fraction of a TSO skb to send
now.

One way to test that conjecture would be to disable
tcp_tso_should_defer() by adding a:

   goto send_now;

at the top of tcp_tso_should_defer().

If that doesn't prevent the freezes then I would recommend adding
printks or other instrumentation to  tcp_write_xmit() to log:

- time
- ca_state
- cwnd
- ssthresh
- tcp_packets_in_flight()
- the reason for breaking out of the tcp_write_xmit() loop (tso
deferral, no packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)

cheers,
neal

[PATCH net] tcp: when scheduling TLP, time of RTO should account for current ACK

2017-11-17 Thread Neal Cardwell

Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.

After the following fix:

df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:

t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets

The response before df92c8394e6e that was:
  -> schedule a TLP for now + RTO_interval

The response after df92c8394e6e is:
  -> schedule a TLP for t=0 + RTO_interval

Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.

Before df92c8394e6e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:

1) Near the top of tcp_ack(), switch from TLP timer to RTO
   at write_queue_head->paket_tx_time + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
   tcp_rearm_rto(sk);

2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
if (flag & FLAG_ACKED) {
   tcp_rearm_rto(sk);

3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
   to TLP at now + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
   tcp_schedule_loss_probe(sk);

In df92c8394e6e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.

Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data 
ACKed/SACKed")
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/tcp.h | 2 +-
 net/ipv4/tcp_input.c  | 2 +-
 net/ipv4/tcp_output.c | 8 +---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85ea578195d4..4e09398009c1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -539,7 +539,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
-bool tcp_schedule_loss_probe(struct sock *sk);
+bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto);
 void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 const struct sk_buff *next_skb);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dabbf1d392fb..f31de422b37f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2964,7 +2964,7 @@ void tcp_rearm_rto(struct sock *sk)
 /* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
 static void tcp_set_xmit_timer(struct sock *sk)
 {
-   if (!tcp_schedule_loss_probe(sk))
+   if (!tcp_schedule_loss_probe(sk, true))
tcp_rearm_rto(sk);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 540b7d92cc70..a4d214c7b506 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2391,7 +2391,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 
/* Send one loss probe per tail loss episode. */
if (push_one != 2)
-   tcp_schedule_loss_probe(sk);
+   tcp_schedule_loss_probe(sk, false);
is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
@@ -2399,7 +2399,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
return !tp->packets_out && !tcp_write_queue_empty(sk);
 }
 
-bool tcp_schedule_loss_probe(struct sock *sk)
+bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@@ -2440,7 +2440,9 @@ bool tcp_schedule_loss_probe(struct sock *sk)
}
 
/* If the RTO formula yields an earlier time, then use that time. */
-   rto_delta_us = tcp_rto_delta_us(sk);  /* How far in future is RTO? */
+   rto_delta_us = advancing_rto ?
+   jiffies_to_usecs(inet_csk(sk)->icsk_r

Re: [PATCH v2 net-next] tcp: allow drivers to tweak TSQ logic

2017-11-12 Thread Neal Cardwell

On Sat, Nov 11, 2017 at 6:54 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
> From: Eric Dumazet <eduma...@google.com>
>
> I had many reports that TSQ logic breaks wifi aggregation.
>
> Current logic is to allow up to 1 ms of bytes to be queued into qdisc
> and drivers queues.
>
> But Wifi aggregation needs a bigger budget to allow bigger rates to
> be discovered by various TCP Congestion Controls algorithms.
>
> This patch adds an extra socket field, allowing wifi drivers to select
> another log scale to derive TCP Small Queue credit from current pacing
> rate.
>
> Initial value is 10, meaning that this patch does not change current
> behavior.
>
> We expect wifi drivers to set this field to smaller values (tests have
> been done with values from 6 to 9)
>
> They would have to use following template :
>
> if (skb->sk && skb->sk->sk_pacing_shift != MY_PACING_SHIFT)
>  skb->sk->sk_pacing_shift = MY_PACING_SHIFT;
>
>
> Ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1670041
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Johannes Berg <johannes.b...@intel.com>
> Cc: Toke Høiland-Jørgensen <t...@toke.dk>
> Cc: Kir Kolyshkin <k...@openvz.org>
> ---

Nice. Thanks, Eric!

Acked-by: Neal Cardwell <ncardw...@google.com>

neal

Re: [PATCH net-next] tcp: tcp_mtu_probing() cleanup

2017-11-03 Thread Neal Cardwell

On Fri, Nov 3, 2017 at 9:09 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> Reduce one indentation level to make code more readable.
> tcp_sync_mss() can be factorized.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---
>  net/ipv4/tcp_timer.c |   31 ++-
>  1 file changed, 14 insertions(+), 17 deletions(-)

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net-next] tcp: tcp_fragment() should not assume rtx skbs

2017-11-02 Thread Neal Cardwell

On Thu, Nov 2, 2017 at 9:10 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> While stress testing MTU probing, we had crashes in list_del() that we 
> root-caused
> to the fact that tcp_fragment() is unconditionally inserting the freshly 
> allocated
> skb into tsorted_sent_queue list.
>
> But this list is supposed to contain skbs that were sent.
> This was mostly harmless until MTU probing was enabled.
>
> Fortunately we can use the tcp_queue enum added later (but in same linux 
> version)
> for rtx-rb-tree to fix the bug.
>
> Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK 
> recovery")
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Neal Cardwell <ncardw...@google.com>

Thanks, Eric!

neal

Re: [PATCH net] tcp: fix tcp_mtu_probe() vs highest_sack

2017-10-31 Thread Neal Cardwell

On Tue, Oct 31, 2017 at 2:08 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> Based on SNMP values provided by Roman, Yuchung made the observation
> that some crashes in tcp_sacktag_walk() might be caused by MTU probing.
>
> Looking at tcp_mtu_probe(), I found that when a new skb was placed
> in front of the write queue, we were not updating tcp highest sack.
>
> If one skb is freed because all its content was copied to the new skb
> (for MTU probing), then tp->highest_sack could point to a now freed skb.
>
> Bad things would then happen, including infinite loops.
>
> This patch renames tcp_highest_sack_combine() and uses it
> from tcp_mtu_probe() to fix the bug.
>
> Note that I also removed one test against tp->sacked_out,
> since we want to replace tp->highest_sack regardless of whatever
> condition, since keeping a stale pointer to freed skb is a recipe
> for disaster.
>
> Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct 
> access")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: Alexei Starovoitov <alexei.starovoi...@gmail.com>
> Reported-by: Roman Gushchin <g...@fb.com>
> Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name>
> ---
>  include/net/tcp.h |6 +++---
>  net/ipv4/tcp_output.c |3 ++-
>  2 files changed, 5 insertions(+), 4 deletions(-)

Acked-by: Neal Cardwell <ncardw...@google.com>

Nice catch, indeed. Thanks, Eric!

neal

Re: Linux ECN Handling

2017-10-23 Thread Neal Cardwell

On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez  wrote:
> Hi All,
>
> I upgraded the kernel on all of our machines to Linux
> 4.13.8-041308-lowlatency. However, I'm still observing the same
> behavior where the source enters a timeout when the CWND=1MSS and it
> receives ECN marks.
>
> Here are the measured flow rates:
> 
>
> Here are snapshots of the packet traces at the sources when they both
> enter a timeout at t=1.6sec:
>
> 10.0.0.1 timeout event:
> 
>
> 10.0.0.3 timeout event:
> 
>
> Both still essentially follow the same sequence of events that I
> mentioned earlier:
> (1) receives an ACK for byte XYZ with the ECN flag set
> (2) stops sending for RTO_min=300ms
> (3) sends a retransmission for byte XYZ
>
> The cwnd samples reported by tcp_probe still indicate that the sources
> are reacting to the ECN marks more than once per window. Here are the
> cwnd samples at the same timeout event mentioned above:
> 
>
> Let me know if there is anything else you think I should try.

Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
scenario. Would you be able to add printk statements in
tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
tp->snd_ssthresh?

Based on the output you may be able to figure out where cwnd is being
set to zero. If not, could you please post the printk output and
tcpdump traces (.pcap, headers-only is fine) from your tests?

thanks,
neal

Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-15 Thread Neal Cardwell

On Fri, Sep 15, 2017 at 1:03 AM, Oleksandr Natalenko
 wrote:
> Hi.
>
> I've applied your test patch but it doesn't fix the issue for me since the
> warning is still there.
>
> Were you able to reproduce it?

Hi,

Thanks for testing that. That is a very useful data point.

I was able to cook up a packetdrill test that could put the connection
in CA_Disorder with retransmitted packets out, but not in CA_Open. So
we do not yet have a test case to reproduce this.

We do not see this warning on our fleet at Google. One significant
difference I see between our environment and yours is that it seems
you run with FACK enabled:

  net.ipv4.tcp_fack = 1

Note that FACK was disabled by default (since it was replaced by RACK)
between kernel v4.10 and v4.11. And this is exactly the time when this
bug started manifesting itself for you and some others, but not our
fleet. So my new working hypothesis would be that this warning is due
to a behavior that only shows up in kernels >=4.11 when FACK is
enabled.

Would you be able to disable FACK ("sysctl net.ipv4.tcp_fack=0" at
boot, or net.ipv4.tcp_fack=0 in /etc/sysctl.conf, or equivalent),
reboot, and test the kernel for a few days to see if the warning still
pops up?

thanks,
neal

[ps: apologies for the previous, mis-formatted post...]

1 2 3 4 >

1 - 100 of 342 matches

Mail list logo