from:"Yuchung Cheng"

[PATCH net] tcp: fix NULL ref in tail loss probe

2018-12-05 Thread Yuchung Cheng

TCP loss probe timer may fire when the retranmission queue is empty but
has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
tcp_rearm_rto which triggers NULL pointer reference by fetching the
retranmission queue head in its sub-routines.

Add a more detailed warning to help catch the root cause of the inflight
accounting inconsistency.

Reported-by: Rafael Tinoco 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_output.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 68b5326f7321..9a1101095298 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2494,15 +2494,18 @@ void tcp_send_loss_probe(struct sock *sk)
goto rearm_timer;
}
skb = skb_rb_last(>tcp_rtx_queue);
+   if (unlikely(!skb)) {
+   WARN_ONCE(tp->packets_out,
+ "invalid inflight: %u state %u cwnd %u mss %d\n",
+ tp->packets_out, sk->sk_state, tp->snd_cwnd, mss);
+   inet_csk(sk)->icsk_pending = 0;
+   return;
+   }
 
/* At most one outstanding TLP retransmission. */
if (tp->tlp_high_seq)
goto rearm_timer;
 
-   /* Retransmit last segment. */
-   if (WARN_ON(!skb))
-   goto rearm_timer;
-
if (skb_still_in_host_queue(sk, skb))
goto rearm_timer;
 
-- 
2.20.0.rc1.387.gf8505762e3-goog

Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread Yuchung Cheng

On Wed, Dec 5, 2018 at 2:28 PM Soheil Hassas Yeganeh  wrote:
>
> On Wed, Dec 5, 2018 at 5:24 PM Eric Dumazet  wrote:
> >
> > If available rwnd is too small, tcp_tso_should_defer()
> > can decide it is worth waiting before splitting a TSO packet.
> >
> > This really means we are rwnd limited.
> >
> > Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> > window")
> > Signed-off-by: Eric Dumazet 
>
> Acked-by: Soheil Hassas Yeganeh 
Reviewed-by: Yuchung Cheng 
>
> Excellent catch! Thank you for the fix, Eric!
>
> > ---
> >  net/ipv4/tcp_output.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 
> > 68b5326f73212ffe7111dd0f91e0a1246fb0ae25..3186902347584090256467d8679320666aa0257e
> >  100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2356,8 +2356,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
> > int mss_now, int nonagle,
> > } else {
> > if (!push_one &&
> > tcp_tso_should_defer(sk, skb, _cwnd_limited,
> > -max_segs))
> > +max_segs)) {
> > +   if (!is_cwnd_limited)
> > +   is_rwnd_limited = true;
> > break;
> > +   }
> > }
> >
> > limit = mss_now;
> > --
> > 2.20.0.rc2.403.gdbc3b29805-goog
> >

Re: [PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Yuchung Cheng

On Fri, Nov 30, 2018 at 10:28 AM Sharath Chandra Vurukala
 wrote:
>
> when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
> called if the retranmsission fails due to local congestion,
> backoff should not incremented.
>
> tcp_retransmit_skb() returns non-zero negative value in some cases of
> failure but the caller tcp_retransmission_timer() has a check for
> failure which checks if the return value is greater than zero.
> The check is corrected to check for non-zero value.
>
> Signed-off-by: Sharath Chandra Vurukala 
Perhaps my previous comment was not clear: your bug-fix patch is incorrect.

On local congestion, tcp_retransmit_skb returns positive values
*only*. negative values do not indicate local congestion.

> ---
>  net/ipv4/tcp_timer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 091c5392..c19f371 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -511,7 +511,7 @@ void tcp_retransmit_timer(struct sock *sk)
>
> tcp_enter_loss(sk);
>
> -   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) > 0) {
> +   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) != 0) {
> /* Retransmission failed because of local congestion,
>  * do not backoff.
>  */
> --
> 1.9.1
>

[PATCH net 3/3] tcp: fix SNMP TCP timeout under-estimation

2018-11-28 Thread Yuchung Cheng

Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
1. It counts all SYN and SYN-ACK timeouts
2. It counts timeouts in other states except recurring timeouts and
   timeouts after fast recovery or disorder state.

Such selective accounting makes analysis difficult and complicated. For
example the monitoring system needs to collect many other SNMP counters
to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
counter simply counts all the retransmit timeout (SYN or data or FIN).

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_timer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 94d858c604f6..5cd02b7b62f6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -482,11 +482,12 @@ void tcp_retransmit_timer(struct sock *sk)
goto out_reset_timer;
}
 
+   __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTS);
if (tcp_write_timeout(sk))
goto out;
 
if (icsk->icsk_retransmits == 0) {
-   int mib_idx;
+   int mib_idx = 0;
 
if (icsk->icsk_ca_state == TCP_CA_Recovery) {
if (tcp_is_sack(tp))
@@ -501,10 +502,9 @@ void tcp_retransmit_timer(struct sock *sk)
mib_idx = LINUX_MIB_TCPSACKFAILURES;
else
mib_idx = LINUX_MIB_TCPRENOFAILURES;
-   } else {
-   mib_idx = LINUX_MIB_TCPTIMEOUTS;
}
-   __NET_INC_STATS(sock_net(sk), mib_idx);
+   if (mib_idx)
+   __NET_INC_STATS(sock_net(sk), mib_idx);
}
 
tcp_enter_loss(sk);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net 1/3] tcp: fix off-by-one bug on aborting window-probing socket

2018-11-28 Thread Yuchung Cheng

Previously there is an off-by-one bug on determining when to abort
a stalled window-probing socket. This patch fixes that so it is
consistent with tcp_write_timeout().

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 5f8b6d3cd855..94d858c604f6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -376,7 +376,7 @@ static void tcp_probe_timer(struct sock *sk)
return;
}
 
-   if (icsk->icsk_probes_out > max_probes) {
+   if (icsk->icsk_probes_out >= max_probes) {
 abort: tcp_write_err(sk);
} else {
/* Only send another probe if we didn't close things up. */
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net 2/3] tcp: fix SNMP under-estimation on failed retransmission

2018-11-28 Thread Yuchung Cheng

Previously the SNMP counter LINUX_MIB_TCPRETRANSFAIL is not counting
the TSO/GSO properly on failed retransmission. This patch fixes that.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c5dc4c4fdadd..87bd1c61f4bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2929,7 +2929,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
} else if (err != -EBUSY) {
-   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
+   NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL, segs);
}
return err;
 }
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net 0/3] fixes in timeout and retransmission accounting

2018-11-28 Thread Yuchung Cheng

This patch set has assorted fixes of minor accounting issues in
timeout, window probe, and retransmission stats.

Yuchung Cheng (3):
  tcp: fix off-by-one bug on aborting window-probing socket
  tcp: fix SNMP under-estimation on failed retransmission
  tcp: fix SNMP TCP timeout under-estimation

 net/ipv4/tcp_output.c |  2 +-
 net/ipv4/tcp_timer.c  | 10 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

Re: [PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-27 Thread Yuchung Cheng

On Mon, Nov 26, 2018 at 1:35 AM, Sharath Chandra Vurukala
 wrote:
> when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
> called if the retranmsission fails due to local congestion,
> backoff should not incremented.
>
> tcp_retransmit_skb() returns non-zero negative value in some cases of
> failure but the caller tcp_retransmission_timer() has a check for
> failure which checks if the return value is greater than zero.
> The check is corrected to check for non-zero value.
Not sure about this fix. The specific check is to handle local
congestion which is only indicated by positive return values.

>
> Change-Id: I494fed73b2e385216402c91e9558d5c2884add5b
> Signed-off-by: Sharath Chandra Vurukala 
> ---
>  net/ipv4/tcp_timer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 4be66e4..a70b4a9 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -536,7 +536,7 @@ void tcp_retransmit_timer(struct sock *sk)
>
> tcp_enter_loss(sk);
>
> -   if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) > 0) {
> +   if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) != 0) {
> /* Retransmission failed because of local congestion,
>  * do not backoff.
>  */
> --
> 1.9.1
>

Re: [PATCH v2 net-next 0/4] tcp: take a bit more care of backlog stress

2018-11-27 Thread Yuchung Cheng

On Tue, Nov 27, 2018 at 7:57 AM, Eric Dumazet  wrote:
> While working on the SACK compression issue Jean-Louis Dupond
> reported, we found that his linux box was suffering very hard
> from tail drops on the socket backlog queue.
>
> First patch hints the compiler about sack flows being the norm.
>
> Second patch changes non-sack code in preparation of the ack
> compression.
>
> Third patch fixes tcp_space() to take backlog into account.
>
> Fourth patch is attempting coalescing when a new packet must
> be added to the backlog queue. Cooking bigger skbs helps
> to keep backlog list smaller and speeds its handling when
> user thread finally releases the socket lock.
>
> v2: added feedback from Neal : tcp: take care of compressed acks in 
> tcp_add_reno_sack()
> added : tcp: hint compiler about sack flows
> added : tcp: make tcp_space() aware of socket backlog
Great feature!

Acked-by: Yuchung Cheng 

>
>
>
> Eric Dumazet (4):
>   tcp: hint compiler about sack flows
>   tcp: take care of compressed acks in tcp_add_reno_sack()
>   tcp: make tcp_space() aware of socket backlog
>   tcp: implement coalescing on backlog queue
>
>  include/net/tcp.h |  4 +-
>  include/uapi/linux/snmp.h |  1 +
>  net/ipv4/proc.c   |  1 +
>  net/ipv4/tcp_input.c  | 58 +++---
>  net/ipv4/tcp_ipv4.c   | 88 ---
>  5 files changed, 119 insertions(+), 33 deletions(-)
>
> --
> 2.20.0.rc0.387.gc7a69e6b6c-goog
>

Re: [PATCH iproute2] ss: add support for delivered and delivered_ce fields

2018-11-26 Thread Yuchung Cheng

On Mon, Nov 26, 2018 at 2:29 PM, Eric Dumazet  wrote:
> Kernel support was added in linux-4.18 in commit feb5f2ec6464
> ("tcp: export packets delivery info")
>
> Tested:
>
> ss -ti
> ...
> ESTAB   0 2270520  [2607:f8b0:8099:e16::]:47646   
> [2607:f8b0:8099:e18::]:38953
>  ts sack cubic wscale:8,8 rto:7 rtt:2.824/0.278 mss:1428
>  pmtu:1500 rcvmss:536 advmss:1428 cwnd:89 ssthresh:62 
> bytes_acked:2097871945
> segs_out:1469144 segs_in:65221 data_segs_out:1469142 send 360.0Mbps 
> lastsnd:2
> lastrcv:99231 lastack:2 pacing_rate 431.9Mbps delivery_rate 246.4Mbps
> (*) delivered:1469099 delivered_ce:424799
> busy:99231ms unacked:44 rcv_space:14280 rcv_ssthresh:65535
> notsent:2207688 minrtt:0.228
>
> Signed-off-by: Eric Dumazet 
Acked-by: Yuchung Cheng 

Thank you Eric!
> ---
>  misc/ss.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/misc/ss.c b/misc/ss.c
> index 
> e4d6ae489e798419fa6ce6fb0f4b8b0b3232adf6..3aa94f235085512510dca9fd597e8e37aaaf0fd3
>  100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -817,6 +817,8 @@ struct tcpstat {
> unsigned intfackets;
> unsigned intreordering;
> unsigned intnot_sent;
> +   unsigned intdelivered;
> +   unsigned intdelivered_ce;
> double  rcv_rtt;
> double  min_rtt;
> int rcv_space;
> @@ -2483,6 +2485,10 @@ static void tcp_stats_print(struct tcpstat *s)
>
> if (s->delivery_rate)
> out(" delivery_rate %sbps", sprint_bw(b1, s->delivery_rate));
> +   if (s->delivered)
> +   out(" delivered:%u", s->delivered);
> +   if (s->delivered_ce)
> +   out(" delivered_ce:%u", s->delivered_ce);
> if (s->app_limited)
> out(" app_limited");
>
> @@ -2829,6 +2835,8 @@ static void tcp_show_info(const struct nlmsghdr *nlh, 
> struct inet_diag_msg *r,
> s.busy_time = info->tcpi_busy_time;
> s.rwnd_limited = info->tcpi_rwnd_limited;
> s.sndbuf_limited = info->tcpi_sndbuf_limited;
> +   s.delivered = info->tcpi_delivered;
> +   s.delivered_ce = info->tcpi_delivered_ce;
> tcp_stats_print();
> free(s.dctcp);
> free(s.bbr_info);
> --
> 2.20.0.rc0.387.gc7a69e6b6c-goog
>

Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Yuchung Cheng

On Wed, Nov 21, 2018 at 2:40 PM, Eric Dumazet  wrote:
>
>
> On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
>
>>> +
>> Really nice! would it make sense to re-use (some of) the similar
>> tcp_try_coalesce()?
>>
>
> Maybe, but it is a bit complex, since skbs in receive queues (regular or out 
> of order)
> are accounted differently (they have skb->destructor set)
>
> Also they had the TCP header pulled already, while the backlog coalescing 
> also has
> to make sure TCP options match.
>
> Not sure if we want to add extra parameters and conditional checks...
Makes sense.

Acked-by: Yuchung Cheng 

>
>

Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Yuchung Cheng

On Wed, Nov 21, 2018 at 4:18 PM, Eric Dumazet  wrote:
> On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet  wrote:
>> This is basically what the patch does, the while loop breaks when we have 
>> freed
>> just enough skbs.
>
> Also this is the patch we tested with Jean-Louis on his host, bring
> very nice results,
> even from an old stack sender (the one that had problems with the SACK
> compression we just fixed)
>
> Keep in mind we are dealing here with the exception, I would not spend
> too much time
> testing another variant if this one simply works.
To clarify I do think this patch set is overall useful so I only
wanted to discuss the specifics of the head drop.

It occurs to me we check the limit differently (one w/ 64KB more), so
we may overcommit and trim more often than necessary?

Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Yuchung Cheng

On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet  wrote:
>
>
> On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
>>> Under high stress, and if GRO or coalescing does not help,
>>> we better make room in backlog queue to be able to keep latest
>>> packet coming.
>>>
>>> This generally helps fast recovery, given that we often receive
>>> packets in order.
>>
>> I like the benefit of fast recovery but I am a bit leery about head
>> drop causing HoLB on large read, while tail drops can be repaired by
>> RACK and TLP already. Hmm -
>
> This is very different pattern here.
>
> We have a train of packets coming, the last packet is not a TLP probe...
>
> Consider this train coming from an old stack without burst control nor pacing.
>
> This patch guarantees last packet will be processed, and either :
>
> 1) We are a receiver, we will send a SACK. Sender will typically start 
> recovery
>
> 2) We are a sender, we will process the most recent ACK sent by the receiver.
>

Sure on the sender it's universally good.

On the receiver side my scenario was not the last packet being TLP.
AFAIU the new design will first try coalesce the incoming skb to the
tail one then exit. Otherwise it's queued to the back with an
additional 64KB space credit. This patch checks the space w/o the
extra credit and drops the head skb. If the head skb has been
coalesced, we might end dropping the first big chunk that may need a
few rounds of fast recovery to repair. But I am likely to
misunderstand the patch :-)

Would it make sense to check the space first before the coalesce
operation, and drop just enough bytes of the head to make room?

Re: [PATCH net-next 1/3] tcp: remove hdrlen argument from tcp_queue_rcv()

2018-11-21 Thread Yuchung Cheng

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
> Only one caller needs to pull TCP headers, so lets
> move __skb_pull() to the caller side.
>
> Signed-off-by: Eric Dumazet 
> ---
Acked-by: Yuchung Cheng 

>  net/ipv4/tcp_input.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> edaaebfbcd4693ef6689f3f4c71733b3888c7c2c..e0ad7d3825b5945049c099171ce36e5c8bb9ba99
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4602,13 +4602,12 @@ static void tcp_data_queue_ofo(struct sock *sk, 
> struct sk_buff *skb)
> }
>  }
>
> -static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, 
> int hdrlen,
> - bool *fragstolen)
> +static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
> + bool *fragstolen)
>  {
> int eaten;
> struct sk_buff *tail = skb_peek_tail(>sk_receive_queue);
>
> -   __skb_pull(skb, hdrlen);
> eaten = (tail &&
>  tcp_try_coalesce(sk, tail,
>   skb, fragstolen)) ? 1 : 0;
> @@ -4659,7 +4658,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
> size_t size)
> TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
> TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
>
> -   if (tcp_queue_rcv(sk, skb, 0, )) {
> +   if (tcp_queue_rcv(sk, skb, )) {
> WARN_ON_ONCE(fragstolen); /* should not happen */
> __kfree_skb(skb);
> }
> @@ -4719,7 +4718,7 @@ static void tcp_data_queue(struct sock *sk, struct 
> sk_buff *skb)
> goto drop;
> }
>
> -   eaten = tcp_queue_rcv(sk, skb, 0, );
> +   eaten = tcp_queue_rcv(sk, skb, );
> if (skb->len)
> tcp_event_data_recv(sk, skb);
> if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
> @@ -5585,8 +5584,8 @@ void tcp_rcv_established(struct sock *sk, struct 
> sk_buff *skb)
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
>
> /* Bulk data transfer: receiver */
> -   eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
> - );
> +   __skb_pull(skb, tcp_header_len);
> +   eaten = tcp_queue_rcv(sk, skb, );
>
> tcp_event_data_recv(sk, skb);
>
> --
> 2.19.1.1215.g8438c0b245-goog
>

Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Yuchung Cheng

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
> Under high stress, and if GRO or coalescing does not help,
> we better make room in backlog queue to be able to keep latest
> packet coming.
>
> This generally helps fast recovery, given that we often receive
> packets in order.

I like the benefit of fast recovery but I am a bit leery about head
drop causing HoLB on large read, while tail drops can be repaired by
RACK and TLP already. Hmm -

>
> Signed-off-by: Eric Dumazet 
> Tested-by: Jean-Louis Dupond 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
>  net/ipv4/tcp_ipv4.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 
> 401e1d1cb904a4c7963d8baa419cfbf178593344..36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696
>  100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1693,6 +1693,20 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff 
> *skb)
> __skb_push(skb, hdrlen);
> }
>
> +   while (sk_rcvqueues_full(sk, limit)) {
> +   struct sk_buff *head;
> +
> +   head = sk->sk_backlog.head;
> +   if (!head)
> +   break;
> +   sk->sk_backlog.head = head->next;
> +   if (!head->next)
> +   sk->sk_backlog.tail = NULL;
> +   skb_mark_not_on_list(head);
> +   sk->sk_backlog.len -= head->truesize;
> +   kfree_skb(head);
> +   __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> +   }
> /* Only socket owner can try to collapse/prune rx queues
>  * to reduce memory overhead, so add a little headroom here.
>  * Few sockets backlog are possibly concurrently non empty.
> --
> 2.19.1.1215.g8438c0b245-goog
>

Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-21 Thread Yuchung Cheng

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> Signed-off-by: Eric Dumazet 
> Tested-by: Jean-Louis Dupond 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
>  include/uapi/linux/snmp.h |  1 +
>  net/ipv4/proc.c   |  1 +
>  net/ipv4/tcp_ipv4.c   | 75 +++
>  3 files changed, 71 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index 
> f80135e5feaa88609db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79
>  100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -243,6 +243,7 @@ enum
> LINUX_MIB_TCPREQQFULLDROP,  /* TCPReqQFullDrop */
> LINUX_MIB_TCPRETRANSFAIL,   /* TCPRetransFail */
> LINUX_MIB_TCPRCVCOALESCE,   /* TCPRcvCoalesce */
> +   LINUX_MIB_TCPBACKLOGCOALESCE,   /* TCPBacklogCoalesce */
> LINUX_MIB_TCPOFOQUEUE,  /* TCPOFOQueue */
> LINUX_MIB_TCPOFODROP,   /* TCPOFODrop */
> LINUX_MIB_TCPOFOMERGE,  /* TCPOFOMerge */
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 
> 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9
>  100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
> SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
> SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
> SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
> +   SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
> SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
> SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
> SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 
> 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344
>  100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
>  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
> u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
> -
> -   /* Only socket owner can try to collapse/prune rx queues
> -* to reduce memory overhead, so add a little headroom here.
> -* Few sockets backlog are possibly concurrently non empty.
> -*/
> -   limit += 64*1024;
> +   struct skb_shared_info *shinfo;
> +   const struct tcphdr *th;
> +   struct sk_buff *tail;
> +   unsigned int hdrlen;
>
> /* In case all data was pulled from skb frags (in __pskb_pull_tail()),
>  * we can fix skb->truesize to its real value to avoid future drops.
> @@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff 
> *skb)
>
> skb_dst_drop(skb);
>
> +   if (unlikely(tcp_checksum_complete(skb))) {
> +   bh_unlock_sock(sk);
> +   __TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
> +   __TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> +   return true;
> +   }
> +
> +   /* Attempt coalescing to last skb in backlog, even if we are
> +* above the limits.
> +* This is okay because skb capacity is limited to MAX_SKB_FRAGS.
> +*/
> +   th = (const struct tcphdr *)skb->data;
> +   hdrlen = th->doff * 4;
> +   shinfo = skb_shinfo(skb);
> +
> +   if (!shinfo->gso_size)
> +   shinfo->gso_size = skb->len - hdrlen;
> +
> +   if (!shinfo->gso_segs)
> +   shinfo->gso_segs = 1;
> +
> +   tail = sk->sk_backlog.tail;
> +   if (tail &&
> +   TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
> +#ifdef CONFIG_TLS_DEVICE
> +   tail->decrypted == skb->decrypted &&
> +#endif
> +   !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
> +   hdrlen - sizeof(*th))) {
> +

Re: [PATCH net-next 3/3] tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffies

2018-11-12 Thread Yuchung Cheng

On Sun, Nov 11, 2018 at 11:06 AM, Neal Cardwell  wrote:
> On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>>
>> tcp_tso_should_defer() first heuristic is to not defer
>> if last send is "old enough".
>>
>> Its current implementation uses jiffies and its low granularity.
>>
>> TSO autodefer performance should not rely on kernel HZ :/
>>
>> After EDT conversion, we have state variables in nanoseconds that
>> can allow us to properly implement the heuristic.
>>
>> This patch increases TSO chunk sizes on medium rate flows,
>> especially when receivers do not use GRO or similar aggregation.
>>
>> It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
>> behavior more uniform.
>>
>> Signed-off-by: Eric Dumazet 
>> Acked-by: Soheil Hassas Yeganeh 
>> ---
>
> Nice. Thanks!
>
> Acked-by: Neal Cardwell 
Acked-by: Yuchung Cheng 

Love it
>
> neal

[PATCH net-next] tcp: refactor DCTCP ECN ACK handling

2018-10-08 Thread Yuchung Cheng

DCTCP has two parts - a new ECN signalling mechanism and the response
function to it. The first part can be used by other congestion
control for DCTCP-ECN deployed networks. This patch moves that part
into a separate tcp_dctcp.h to be used by other congestion control
module (like how Yeah uses Vegas algorithmas). For example, BBR is
experimenting such ECN signal currently
https://tinyurl.com/ietf-102-iccrg-bbr2

Signed-off-by: Yuchung Cheng 
Signed-off-by: Yousuk Seung 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_dctcp.c | 55 
 net/ipv4/tcp_dctcp.h | 40 
 2 files changed, 44 insertions(+), 51 deletions(-)
 create mode 100644 net/ipv4/tcp_dctcp.h

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index ca61e2a659e7..cd4814f7e962 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include "tcp_dctcp.h"
 
 #define DCTCP_MAX_ALPHA1024U
 
@@ -118,54 +119,6 @@ static u32 dctcp_ssthresh(struct sock *sk)
return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 
2U);
 }
 
-/* Minimal DCTP CE state machine:
- *
- * S:  0 <- last pkt was non-CE
- * 1 <- last pkt was CE
- */
-
-static void dctcp_ce_state_0_to_1(struct sock *sk)
-{
-   struct dctcp *ca = inet_csk_ca(sk);
-   struct tcp_sock *tp = tcp_sk(sk);
-
-   if (!ca->ce_state) {
-   /* State has changed from CE=0 to CE=1, force an immediate
-* ACK to reflect the new CE state. If an ACK was delayed,
-* send that first to reflect the prior CE state.
-*/
-   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
-   __tcp_send_ack(sk, ca->prior_rcv_nxt);
-   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
-   }
-
-   ca->prior_rcv_nxt = tp->rcv_nxt;
-   ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
-}
-
-static void dctcp_ce_state_1_to_0(struct sock *sk)
-{
-   struct dctcp *ca = inet_csk_ca(sk);
-   struct tcp_sock *tp = tcp_sk(sk);
-
-   if (ca->ce_state) {
-   /* State has changed from CE=1 to CE=0, force an immediate
-* ACK to reflect the new CE state. If an ACK was delayed,
-* send that first to reflect the prior CE state.
-*/
-   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
-   __tcp_send_ack(sk, ca->prior_rcv_nxt);
-   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
-   }
-
-   ca->prior_rcv_nxt = tp->rcv_nxt;
-   ca->ce_state = 0;
-
-   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
-}
-
 static void dctcp_update_alpha(struct sock *sk, u32 flags)
 {
const struct tcp_sock *tp = tcp_sk(sk);
@@ -230,12 +183,12 @@ static void dctcp_state(struct sock *sk, u8 new_state)
 
 static void dctcp_cwnd_event(struct sock *sk, enum tcp_ca_event ev)
 {
+   struct dctcp *ca = inet_csk_ca(sk);
+
switch (ev) {
case CA_EVENT_ECN_IS_CE:
-   dctcp_ce_state_0_to_1(sk);
-   break;
case CA_EVENT_ECN_NO_CE:
-   dctcp_ce_state_1_to_0(sk);
+   dctcp_ece_ack_update(sk, ev, >prior_rcv_nxt, >ce_state);
break;
default:
/* Don't care for the rest. */
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
new file mode 100644
index ..d69a77cbd0c7
--- /dev/null
+++ b/net/ipv4/tcp_dctcp.h
@@ -0,0 +1,40 @@
+#ifndef _TCP_DCTCP_H
+#define _TCP_DCTCP_H
+
+static inline void dctcp_ece_ack_cwr(struct sock *sk, u32 ce_state)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if (ce_state == 1)
+   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+   else
+   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+}
+
+/* Minimal DCTP CE state machine:
+ *
+ * S:  0 <- last pkt was non-CE
+ * 1 <- last pkt was CE
+ */
+static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
+   u32 *prior_rcv_nxt, u32 *ce_state)
+{
+   u32 new_ce_state = (evt == CA_EVENT_ECN_IS_CE) ? 1 : 0;
+
+   if (*ce_state != new_ce_state) {
+   /* CE state has changed, force an immediate ACK to
+* reflect the new CE state. If an ACK was delayed,
+* send that first to reflect the prior CE state.
+*/
+   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
+   dctcp_ece_ack_cwr(sk, *ce_state);
+   __tcp_send_ack(sk, *prior_rcv_nxt);
+   }
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+   }
+   *prior_rcv_nxt = tcp_sk(sk)->rcv_nxt;
+   *ce_state = new_ce_state;
+   dctcp_ece_ack_cwr(sk, new_ce_state);
+}
+
+#endif
-- 
2.19.0.605.g01d371f741-goog

Re: WARN_ON in TLP causing RT throttling

2018-10-02 Thread Yuchung Cheng

On Thu, Sep 27, 2018 at 5:16 PM,  wrote:
>
> On 2018-09-27 13:14, Yuchung Cheng wrote:
>>
>> On Wed, Sep 26, 2018 at 5:09 PM, Eric Dumazet  wrote:
>>>
>>>
>>>
>>>
>>> On 09/26/2018 04:46 PM, stran...@codeaurora.org wrote:
>>> > Hi Eric,
>>> >
>>> > Someone recently reported a crash to us on the 4.14.62 kernel where 
>>> > excessive
>>> > WARNING prints were spamming the logs and causing watchdog bites. The 
>>> > kernel
>>> > does have the following commit by Soheil:
>>> > bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"
>>> >
>>> > Before this bug we see over 1 second of continuous WARN_ON prints from
>>> > tcp_send_loss_probe() like so:
>>> >
>>> > 7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
>>> > 7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
>>> > 7795.539492:   <2>  tcp_write_timer+0x4c/0x74
>>> > 7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
>>> > 7795.547113:   <2>  run_timer_softirq+0x248/0x81c
>>> >
>>> > Specifically, the prints come from the following check:
>>> >
>>> > /* Retransmit last segment. */
>>> > if (WARN_ON(!skb))
>>> > goto rearm_timer;
>>> >
>>> > Since skb is always NULL, we know there's nothing on the write queue or 
>>> > the
>>> > retransmit queue, so we just keep resetting the timer, waiting for more 
>>> > data
>>> > to be queued. However, we were able to determine that the TCP socket is 
>>> > in the
>>> > TCP_FIN_WAIT1 state, so we will no longer be sending any data and these 
>>> > queues
>>> > remain empty.
>>> >
>>> > Would it be appropriate to stop resetting the TLP timer if we detect that 
>>> > the
>>> > connection is starting to close and we have no more data to send the 
>>> > probe with,
>>> > or is there some way that this scenario should already be handled?
>>> >
>>> > Unfortunately, we don't have a reproducer for this crash.
>>> >
>>>
>>> Something is fishy.
>>>
>>> If there is no skb in the queues, then tp->packets_out should be 0,
>>> therefore tcp_rearm_rto() should simply call inet_csk_clear_xmit_timer(sk, 
>>> ICSK_TIME_RETRANS);
>>>
>>> I have never seen this report before.
>>
>> Do you use Fast Open? I am wondering if its a bug when a TFO server
>> closes the socket before the handshake finishes...
>>
>> Either way, it's pretty safe to just stop TLP if write queue is empty
>> for any unexpected reason.
>>
>>>
> Hi Yuchung,
>
> Based on the dumps we were able to get, it appears that TFO was not used in 
> this case.
> We also tried some local experiments where we dropped incoming SYN packets 
> after already
> successful TFO connections on the receive side to see if TFO would trigger 
> this scenario, but
> have not been able to reproduce it.
>
> One other interesting thing we found is that the socket never sent or 
> received any data. It only
> sent/received the packets for the initial handshake and the outgoing FIN.
I wonder if there's a bug in tcp_rtx_queue. rtx_queue should have at
least the FIN packet pending to be acked in TCP_FIN_WAIT1. And the
warning should only show up once if packets_out is 0 due to the check
in tcp_rearm_rto. so it seems packets_out is non-zero (i.e. counting a
FIN) but tcp_rtx_queue has become empty. hmmm

Re: [PATCH net-next] tcp: start receiver buffer autotuning sooner

2018-10-01 Thread Yuchung Cheng

On Mon, Oct 1, 2018 at 3:46 PM, David Miller  wrote:
> From: Yuchung Cheng 
> Date: Mon,  1 Oct 2018 15:42:32 -0700
>
>> Previously receiver buffer auto-tuning starts after receiving
>> one advertised window amount of data. After the initial receiver
>> buffer was raised by patch a337531b942b ("tcp: up initial rmem to
>> 128KB and SYN rwin to around 64KB"), the reciver buffer may take
>> too long to start raising. To address this issue, this patch lowers
>> the initial bytes expected to receive roughly the expected sender's
>> initial window.
>>
>> Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
>> 64KB")
>> Signed-off-by: Yuchung Cheng 
>> Signed-off-by: Wei Wang 
>> Signed-off-by: Neal Cardwell 
>> Signed-off-by: Eric Dumazet 
>> Reviewed-by: Soheil Hassas Yeganeh 
>
> Applied, sorry for applying v1 instead of v2 the the rmem increasing patch.
> :-/
No problem thanks for the fast response!

[PATCH net-next] tcp: start receiver buffer autotuning sooner

2018-10-01 Thread Yuchung Cheng

Previously receiver buffer auto-tuning starts after receiving
one advertised window amount of data. After the initial receiver
buffer was raised by patch a337531b942b ("tcp: up initial rmem to
128KB and SYN rwin to around 64KB"), the reciver buffer may take
too long to start raising. To address this issue, this patch lowers
the initial bytes expected to receive roughly the expected sender's
initial window.

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 
64KB")
Signed-off-by: Yuchung Cheng 
Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7a59f6a96212..bf1aac315490 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -438,7 +438,7 @@ void tcp_init_buffer_space(struct sock *sk)
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
-   tp->rcvq_space.space = tp->rcv_wnd;
+   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
tcp_mstamp_refresh(tp);
tp->rcvq_space.time = tp->tcp_mstamp;
tp->rcvq_space.seq = tp->copied_seq;
-- 
2.19.0.605.g01d371f741-goog

Re: [PATCH net-next v2] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-10-01 Thread Yuchung Cheng

On Sat, Sep 29, 2018 at 11:23 AM, David Miller  wrote:
>
> From: Yuchung Cheng 
> Date: Fri, 28 Sep 2018 13:09:02 -0700
>
> > Previously TCP initial receive buffer is ~87KB by default and
> > the initial receive window is ~29KB (20 MSS). This patch changes
> > the two numbers to 128KB and ~64KB (rounding down to the multiples
> > of MSS) respectively. The patch also simplifies the calculations s.t.
> > the two numbers are directly controlled by sysctl tcp_rmem[1]:
> >
> >   1) Initial receiver buffer budget (sk_rcvbuf): while this should
> >  be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
> >  always override and set a larger size when a new connection
> >  establishes.
> >
> >   2) Initial receive window in SYN: previously it is set to 20
> >  packets if MSS <= 1460. The number 20 was based on the initial
> >  congestion window of 10: the receiver needs twice amount to
> >  avoid being limited by the receive window upon out-of-order
> >  delivery in the first window burst. But since this only
> >  applies if the receiving MSS <= 1460, connection using large MTU
> >  (e.g. to utilize receiver zero-copy) may be limited by the
> >  receive window.
> >
> > This patch also lowers the initial bytes expected to receive in
> > the receiver buffer autotuning algorithm - otherwise the receiver
> > may take two to three rounds to increase the buffer to the
> > appropriate level (2x sender congestion window).
> >
> > With this patch TCP memory configuration is more straight-forward and
> > more properly sized to modern high-speed networks by default. Several
> > popular stacks have been announcing 64KB rwin in SYNs as well.
> >
> > Signed-off-by: Yuchung Cheng 
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Neal Cardwell 
> > Signed-off-by: Eric Dumazet 
> > Reviewed-by: Soheil Hassas Yeganeh 
>
> Applied, thanks.

Hi David: thanks for taking this patch - I didn't notice this earlier
but it seems patch v1 was applied instead of v2? should I submit a
v2-v1-diff patch?

[PATCH net-next v2] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-09-28 Thread Yuchung Cheng

Previously TCP initial receive buffer is ~87KB by default and
the initial receive window is ~29KB (20 MSS). This patch changes
the two numbers to 128KB and ~64KB (rounding down to the multiples
of MSS) respectively. The patch also simplifies the calculations s.t.
the two numbers are directly controlled by sysctl tcp_rmem[1]:

  1) Initial receiver buffer budget (sk_rcvbuf): while this should
 be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
 always override and set a larger size when a new connection
 establishes.

  2) Initial receive window in SYN: previously it is set to 20
 packets if MSS <= 1460. The number 20 was based on the initial
 congestion window of 10: the receiver needs twice amount to
 avoid being limited by the receive window upon out-of-order
 delivery in the first window burst. But since this only
 applies if the receiving MSS <= 1460, connection using large MTU
 (e.g. to utilize receiver zero-copy) may be limited by the
 receive window.

This patch also lowers the initial bytes expected to receive in
the receiver buffer autotuning algorithm - otherwise the receiver
may take two to three rounds to increase the buffer to the
appropriate level (2x sender congestion window).

With this patch TCP memory configuration is more straight-forward and
more properly sized to modern high-speed networks by default. Several
popular stacks have been announcing 64KB rwin in SYNs as well.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp.c|  4 ++--
 net/ipv4/tcp_input.c  | 30 +-
 net/ipv4/tcp_output.c | 25 -
 3 files changed, 11 insertions(+), 48 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 69c236943f56..dcf51fbf5ec7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3896,8 +3896,8 @@ void __init tcp_init(void)
init_net.ipv4.sysctl_tcp_wmem[2] = max(64*1024, max_wshare);
 
init_net.ipv4.sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
-   init_net.ipv4.sysctl_tcp_rmem[1] = 87380;
-   init_net.ipv4.sysctl_tcp_rmem[2] = max(87380, max_rshare);
+   init_net.ipv4.sysctl_tcp_rmem[1] = 131072;
+   init_net.ipv4.sysctl_tcp_rmem[2] = max(131072, max_rshare);
 
pr_info("Hash tables configured (established %u bind %u)\n",
tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d703a0b3b6a2..4f714a031618 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -426,27 +426,9 @@ static void tcp_grow_window(struct sock *sk, const struct 
sk_buff *skb)
}
 }
 
-/* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
-{
-   u32 mss = tcp_sk(sk)->advmss;
-   int rcvmem;
-
-   rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
-tcp_default_init_rwnd(mss);
-
-   /* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
-* Allow enough cushion so that sender is not limited by our window
-*/
-   if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)
-   rcvmem <<= 2;
-
-   if (sk->sk_rcvbuf < rcvmem)
-   sk->sk_rcvbuf = min(rcvmem, 
sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
-}
-
-/* 4. Try to fixup all. It is made immediately after connection enters
- *established state.
+/* 3. Try to fixup all. It is made immediately after connection enters
+ *established state. Budget the space to the expected initial window
+ *of burst to auto-tune the receive buffer right after the first round.
  */
 void tcp_init_buffer_space(struct sock *sk)
 {
@@ -454,12 +436,10 @@ void tcp_init_buffer_space(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
int maxwin;
 
-   if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
-   tcp_fixup_rcvbuf(sk);
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
-   tp->rcvq_space.space = tp->rcv_wnd;
+   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * 
tp->advmss);
tcp_mstamp_refresh(tp);
tp->rcvq_space.time = tp->tcp_mstamp;
tp->rcvq_space.seq = tp->copied_seq;
@@ -485,7 +465,7 @@ void tcp_init_buffer_space(struct sock *sk)
tp->snd_cwnd_stamp = tcp_jiffies32;
 }
 
-/* 5. Recalculate window clamp after socket hit its memory bounds. */
+/* 4. Recalculate window clamp after socket hit its memory bounds. */
 static void tcp_clamp_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fe7855b090e4..059b67af28b1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -195,21 +195,6 @@ static inline void tcp_even

Re: [PATCH net-next] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-09-27 Thread Yuchung Cheng

On Thu, Sep 27, 2018 at 11:21 AM, Yuchung Cheng  wrote:
> Previously TCP initial receive buffer is ~87KB by default and
> the initial receive window is ~29KB (20 MSS). This patch changes
> the two numbers to 128KB and ~64KB (rounding down to the multiples
> of MSS) respectively. The patch also simplifies the calculations s.t.
> the two numbers are directly controlled by sysctl tcp_rmem[1]:
>
>   1) Initial receiver buffer budget (sk_rcvbuf): while this should
>  be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
>  always override and set a larger size when a new connection
>  establishes.
>
>   2) Initial receive window in SYN: previously it is set to 20
>  packets if MSS <= 1460. The number 20 was based on the initial
>  congestion window of 10: the receiver needs twice amount to
>  avoid being limited by the receive window upon out-of-order
>  delivery in the first window burst. But since this only
>  applies if the receiving MSS <= 1460, connection using large MTU
>  (e.g. to utilize receiver zero-copy) may be limited by the
>  receive window.
>
> With this patch TCP memory configuration is more straight-forward and
> more properly sized to modern high-speed networks by default. Several
> popular stacks have been announcing 64KB rwin in SYNs as well.
Sorry please ignore this patch for now.

We need to adjust rbuf autotuning as well otherwise w/ larger init
rbuf it may increase too slowly during slow start. Will submit a v2

>
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Wei Wang 
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Eric Dumazet 
> Reviewed-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp.c|  4 ++--
>  net/ipv4/tcp_input.c  | 25 ++---
>  net/ipv4/tcp_output.c | 25 -
>  3 files changed, 8 insertions(+), 46 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 69c236943f56..dcf51fbf5ec7 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3896,8 +3896,8 @@ void __init tcp_init(void)
> init_net.ipv4.sysctl_tcp_wmem[2] = max(64*1024, max_wshare);
>
> init_net.ipv4.sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
> -   init_net.ipv4.sysctl_tcp_rmem[1] = 87380;
> -   init_net.ipv4.sysctl_tcp_rmem[2] = max(87380, max_rshare);
> +   init_net.ipv4.sysctl_tcp_rmem[1] = 131072;
> +   init_net.ipv4.sysctl_tcp_rmem[2] = max(131072, max_rshare);
>
> pr_info("Hash tables configured (established %u bind %u)\n",
> tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d703a0b3b6a2..7a59f6a96212 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -426,26 +426,7 @@ static void tcp_grow_window(struct sock *sk, const 
> struct sk_buff *skb)
> }
>  }
>
> -/* 3. Tuning rcvbuf, when connection enters established state. */
> -static void tcp_fixup_rcvbuf(struct sock *sk)
> -{
> -   u32 mss = tcp_sk(sk)->advmss;
> -   int rcvmem;
> -
> -   rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
> -tcp_default_init_rwnd(mss);
> -
> -   /* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
> -* Allow enough cushion so that sender is not limited by our window
> -*/
> -   if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)
> -   rcvmem <<= 2;
> -
> -   if (sk->sk_rcvbuf < rcvmem)
> -   sk->sk_rcvbuf = min(rcvmem, 
> sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
> -}
> -
> -/* 4. Try to fixup all. It is made immediately after connection enters
> +/* 3. Try to fixup all. It is made immediately after connection enters
>   *established state.
>   */
>  void tcp_init_buffer_space(struct sock *sk)
> @@ -454,8 +435,6 @@ void tcp_init_buffer_space(struct sock *sk)
> struct tcp_sock *tp = tcp_sk(sk);
> int maxwin;
>
> -   if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
> -   tcp_fixup_rcvbuf(sk);
> if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
> tcp_sndbuf_expand(sk);
>
> @@ -485,7 +464,7 @@ void tcp_init_buffer_space(struct sock *sk)
> tp->snd_cwnd_stamp = tcp_jiffies32;
>  }
>
> -/* 5. Recalculate window clamp after socket hit its memory bounds. */
> +/* 4. Recalculate window clamp after socket hit its memory bounds. */
>  static void tcp_clamp_window(struct sock *sk)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index fe7855b090e4..059b67af28b1 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/

Re: WARN_ON in TLP causing RT throttling

2018-09-27 Thread Yuchung Cheng

On Wed, Sep 26, 2018 at 5:09 PM, Eric Dumazet  wrote:
>
>
>
> On 09/26/2018 04:46 PM, stran...@codeaurora.org wrote:
> > Hi Eric,
> >
> > Someone recently reported a crash to us on the 4.14.62 kernel where 
> > excessive
> > WARNING prints were spamming the logs and causing watchdog bites. The kernel
> > does have the following commit by Soheil:
> > bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"
> >
> > Before this bug we see over 1 second of continuous WARN_ON prints from
> > tcp_send_loss_probe() like so:
> >
> > 7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
> > 7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
> > 7795.539492:   <2>  tcp_write_timer+0x4c/0x74
> > 7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
> > 7795.547113:   <2>  run_timer_softirq+0x248/0x81c
> >
> > Specifically, the prints come from the following check:
> >
> > /* Retransmit last segment. */
> > if (WARN_ON(!skb))
> > goto rearm_timer;
> >
> > Since skb is always NULL, we know there's nothing on the write queue or the
> > retransmit queue, so we just keep resetting the timer, waiting for more data
> > to be queued. However, we were able to determine that the TCP socket is in 
> > the
> > TCP_FIN_WAIT1 state, so we will no longer be sending any data and these 
> > queues
> > remain empty.
> >
> > Would it be appropriate to stop resetting the TLP timer if we detect that 
> > the
> > connection is starting to close and we have no more data to send the probe 
> > with,
> > or is there some way that this scenario should already be handled?
> >
> > Unfortunately, we don't have a reproducer for this crash.
> >
>
> Something is fishy.
>
> If there is no skb in the queues, then tp->packets_out should be 0,
> therefore tcp_rearm_rto() should simply call inet_csk_clear_xmit_timer(sk, 
> ICSK_TIME_RETRANS);
>
> I have never seen this report before.
Do you use Fast Open? I am wondering if its a bug when a TFO server
closes the socket before the handshake finishes...

Either way, it's pretty safe to just stop TLP if write queue is empty
for any unexpected reason.

>

[PATCH net-next] tcp: up initial rmem to 128KB and SYN rwin to around 64KB

2018-09-27 Thread Yuchung Cheng

Previously TCP initial receive buffer is ~87KB by default and
the initial receive window is ~29KB (20 MSS). This patch changes
the two numbers to 128KB and ~64KB (rounding down to the multiples
of MSS) respectively. The patch also simplifies the calculations s.t.
the two numbers are directly controlled by sysctl tcp_rmem[1]:

  1) Initial receiver buffer budget (sk_rcvbuf): while this should
 be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
 always override and set a larger size when a new connection
 establishes.

  2) Initial receive window in SYN: previously it is set to 20
 packets if MSS <= 1460. The number 20 was based on the initial
 congestion window of 10: the receiver needs twice amount to
 avoid being limited by the receive window upon out-of-order
 delivery in the first window burst. But since this only
 applies if the receiving MSS <= 1460, connection using large MTU
 (e.g. to utilize receiver zero-copy) may be limited by the
 receive window.

With this patch TCP memory configuration is more straight-forward and
more properly sized to modern high-speed networks by default. Several
popular stacks have been announcing 64KB rwin in SYNs as well.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp.c|  4 ++--
 net/ipv4/tcp_input.c  | 25 ++---
 net/ipv4/tcp_output.c | 25 -
 3 files changed, 8 insertions(+), 46 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 69c236943f56..dcf51fbf5ec7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3896,8 +3896,8 @@ void __init tcp_init(void)
init_net.ipv4.sysctl_tcp_wmem[2] = max(64*1024, max_wshare);
 
init_net.ipv4.sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
-   init_net.ipv4.sysctl_tcp_rmem[1] = 87380;
-   init_net.ipv4.sysctl_tcp_rmem[2] = max(87380, max_rshare);
+   init_net.ipv4.sysctl_tcp_rmem[1] = 131072;
+   init_net.ipv4.sysctl_tcp_rmem[2] = max(131072, max_rshare);
 
pr_info("Hash tables configured (established %u bind %u)\n",
tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d703a0b3b6a2..7a59f6a96212 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -426,26 +426,7 @@ static void tcp_grow_window(struct sock *sk, const struct 
sk_buff *skb)
}
 }
 
-/* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
-{
-   u32 mss = tcp_sk(sk)->advmss;
-   int rcvmem;
-
-   rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
-tcp_default_init_rwnd(mss);
-
-   /* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
-* Allow enough cushion so that sender is not limited by our window
-*/
-   if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)
-   rcvmem <<= 2;
-
-   if (sk->sk_rcvbuf < rcvmem)
-   sk->sk_rcvbuf = min(rcvmem, 
sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
-}
-
-/* 4. Try to fixup all. It is made immediately after connection enters
+/* 3. Try to fixup all. It is made immediately after connection enters
  *established state.
  */
 void tcp_init_buffer_space(struct sock *sk)
@@ -454,8 +435,6 @@ void tcp_init_buffer_space(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
int maxwin;
 
-   if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
-   tcp_fixup_rcvbuf(sk);
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
 
@@ -485,7 +464,7 @@ void tcp_init_buffer_space(struct sock *sk)
tp->snd_cwnd_stamp = tcp_jiffies32;
 }
 
-/* 5. Recalculate window clamp after socket hit its memory bounds. */
+/* 4. Recalculate window clamp after socket hit its memory bounds. */
 static void tcp_clamp_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fe7855b090e4..059b67af28b1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -195,21 +195,6 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts,
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
-
-u32 tcp_default_init_rwnd(u32 mss)
-{
-   /* Initial receive window should be twice of TCP_INIT_CWND to
-* enable proper sending of new unsent data during fast recovery
-* (RFC 3517, Section 4, NextSeg() rule (2)). Further place a
-* limit when mss is larger than 1460.
-*/
-   u32 init_rwnd = TCP_INIT_CWND * 2;
-
-   if (mss > 1460)
-   init_rwnd = max((1460 * init_rwnd) / mss, 2U);
-   return init_rwnd;
-}
-
 /* Determine a window scaling and initial window to offer.
  * Based on the assumption that the

[PATCH net-next] tcp: change IPv6 flow-label upon receiving spurious retransmission

2018-08-29 Thread Yuchung Cheng

Currently a Linux IPv6 TCP sender will change the flow label upon
timeouts to potentially steer away from a data path that has gone
bad. However this does not help if the problem is on the ACK path
and the data path is healthy. In this case the receiver is likely
to receive repeated spurious retransmission because the sender
couldn't get the ACKs in time and has recurring timeouts.

This patch adds another feature to mitigate this problem. It
leverages the DSACK states in the receiver to change the flow
label of the ACKs to speculatively re-route the ACK packets.
In order to allow triggering on the second consecutive spurious
RTO, the receiver changes the flow label upon sending a second
consecutive DSACK for a sequence number below RCV.NXT.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c   |  2 ++
 net/ipv4/tcp_input.c | 13 +
 2 files changed, 15 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b8af2fec5ad5..8c4235c098fd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2595,6 +2595,8 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->compressed_ack = 0;
tp->bytes_sent = 0;
tp->bytes_retrans = 0;
+   tp->duplicate_sack[0].start_seq = 0;
+   tp->duplicate_sack[0].end_seq = 0;
tp->dsack_dups = 0;
tp->reord_seen = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4c2dd9f863f7..62508a2f9b21 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4199,6 +4199,17 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, 
u32 end_seq)
tcp_sack_extend(tp->duplicate_sack, seq, end_seq);
 }
 
+static void tcp_rcv_spurious_retrans(struct sock *sk, const struct sk_buff 
*skb)
+{
+   /* When the ACK path fails or drops most ACKs, the sender would
+* timeout and spuriously retransmit the same segment repeatedly.
+* The receiver remembers and reflects via DSACKs. Leverage the
+* DSACK state and change the txhash to re-route speculatively.
+*/
+   if (TCP_SKB_CB(skb)->seq == tcp_sk(sk)->duplicate_sack[0].start_seq)
+   sk_rethink_txhash(sk);
+}
+
 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
@@ -4211,6 +4222,7 @@ static void tcp_send_dupack(struct sock *sk, const struct 
sk_buff *skb)
if (tcp_is_sack(tp) && sock_net(sk)->ipv4.sysctl_tcp_dsack) {
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
+   tcp_rcv_spurious_retrans(sk, skb);
if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
end_seq = tp->rcv_nxt;
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
@@ -4755,6 +4767,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
}
 
if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+   tcp_rcv_spurious_retrans(sk, skb);
/* A retransmit, 2nd most common case.  Force an immediate ack. 
*/
NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, 
TCP_SKB_CB(skb)->end_seq);
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

Re: Fw: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg

2018-08-29 Thread Yuchung Cheng

On Wed, Aug 29, 2018 at 8:02 AM, Stephen Hemminger
 wrote:
>
>
>
> Begin forwarded message:
>
> Date: Sun, 26 Aug 2018 22:24:12 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200943
>
> Bug ID: 200943
>Summary: Repeating tcp_mark_head_lost in dmesg
>Product: Networking
>Version: 2.5
> Kernel Version: 4.14.66
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV4
>   Assignee: step...@networkplumber.org
>   Reporter: rm+...@romanrm.net
> Regression: No
>
> Getting a bunch of these now every hour during continuous ~100 Mbit of network
> traffic.
> What's up with that? Seems harmless, as in the kernel doesn't crash and the
> network connection is not interrupted. (Maybe the particular TCP session is?)
> If there are no ill-effects from this condition, is such spammy WARN_ON really
> necessary?
This warning is likely triggered by buggy remote SACK behaviors, and
is pretty harmless - in my opinion the warning tcp_verify_left_out()
is still worthy to detect other inflight states inconsistencies.

The good news the particular loss recovery code path is disabled by
default on 4.18+ kernels by this patch

commit b38a51fec1c1f693f03b1aa19d0622123634d4b7
Author: Yuchung Cheng 
Date:   Wed May 16 16:40:11 2018 -0700

tcp: disable RFC6675 loss detection


>
> [Mon Aug 27 02:16:11 2018] [ cut here ]
> [Mon Aug 27 02:16:11 2018] WARNING: CPU: 5 PID: 0 at net/ipv4/tcp_input.c:2263
> tcp_mark_head_lost+0x247/0x260
> [Mon Aug 27 02:16:11 2018] Modules linked in: dm_snapshot loop vhost_net vhost
> tap tun ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_DSCP xt_mark ip6t_REJECT nf_reject_ipv6 ipt_REJECT
> nf_reject_ipv4 xt_owner xt_tcpudp xt_set ip_set_hash_net ip_set nfnetlink
> xt_limit xt_length xt_multiport xt_conntrack ip6t_rpfilter ipt_rpfilter
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw
> wireguard ip6_udp_tunnel udp_tunnel ip6table_mangle iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw
> iptable_mangle ip6table_filter ip6_tables matroxfb_base matroxfb_g450
> matroxfb_Ti3026 matroxfb_accel matroxfb_DAC1064 g450_pll matroxfb_misc
> iptable_filter ip_tables x_tables cpufreq_powersave cpufreq_userspace
> cpufreq_conservative 8021q garp mrp
> [Mon Aug 27 02:16:11 2018]  bridge stp llc bonding tcp_bbr sch_fq tcp_illinois
> fuse radeon ttm drm_kms_helper drm i2c_algo_bit it87 hwmon_vid eeepc_wmi
> asus_wmi sparse_keymap rfkill video wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm
> snd_pcm snd_timer snd soundcore joydev evdev pcspkr k10temp fam15h_power
> sp5100_tco sg shpchp wmi pcc_cpufreq acpi_cpufreq button ext4 crc16 mbcache
> jbd2 fscrypto btrfs zstd_decompress zstd_compress xxhash algif_skcipher af_alg
> dm_crypt dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod
> hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq
> async_xor async_tx xor sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0
> multipath linear md_mod vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
> ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [Mon Aug 27 02:16:11 2018]  pcbc aesni_intel aes_x86_64 crypto_simd 
> glue_helper
> cryptd r8169 ahci xhci_pci libahci ohci_hcd ehci_pci mii xhci_hcd ehci_hcd
> i2c_piix4 libata usbcore scsi_mod bnx2
> [Mon Aug 27 02:16:11 2018] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW
>4.14.66-rm1+ #132
> [Mon Aug 27 02:16:11 2018] Hardware name: To be filled by O.E.M. To be filled
> by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [Mon Aug 27 02:16:11 2018] task: 8ba79c679dc0 task.stack: b4d741928000
> [Mon Aug 27 02:16:11 2018] RIP: 0010:tcp_mark_head_lost+0x247/0x260
> [Mon Aug 27 02:16:11 2018] RSP: 0018:8ba7aed437d8 EFLAGS: 00010202
> [Mon Aug 27 02:16:11 2018] RAX: 0018 RBX: 8ba3901a0800 RCX:
> 
> [Mon Aug 27 02:16:11 2018] RDX: 0017 RSI: 0001 RDI:
> 8ba4d47e9000
> [Mon Aug 27 02:16:11 2018] RBP: 8ba4d47e9000 R08: 000d R09:
> 
> [Mon Aug 27 02:16:11 2018] R10: 100c R11:  R12:
> 0001
> [Mon Aug 27 02:16:11 2018] R13: 8ba4d47e9158 R14: 0001 R15:
> 9d0b6708
> [Mon Aug 27 02:16:11 2018] FS:  ()
> GS:8ba7aed400

[PATCH net-next 3/4] tcp: always ACK immediately on hole repairs

2018-08-09 Thread Yuchung Cheng

RFC 5681 sec 4.2:
  To provide feedback to senders recovering from losses, the receiver
  SHOULD send an immediate ACK when it receives a data segment that
  fills in all or part of a gap in the sequence space.

When a gap is partially filled, __tcp_ack_snd_check already checks
the out-of-order queue and correctly send an immediate ACK. However
when a gap is fully filled, the previous implementation only resets
pingpong mode which does not guarantee an immediate ACK because the
quick ACK counter may be zero. This patch addresses this issue by
marking the one-time immediate ACK flag instead.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b8849588c440..9a09ff3afef2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4735,11 +4735,11 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
if (!RB_EMPTY_ROOT(>out_of_order_queue)) {
tcp_ofo_queue(sk);
 
-   /* RFC2581. 4.2. SHOULD send immediate ACK, when
+   /* RFC5681. 4.2. SHOULD send immediate ACK, when
 * gap in queue is filled.
 */
if (RB_EMPTY_ROOT(>out_of_order_queue))
-   inet_csk(sk)->icsk_ack.pingpong = 0;
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
}
 
if (tp->rx_opt.num_sacks)
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 4/4] tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag

2018-08-09 Thread Yuchung Cheng

Previously commit 9aee40006190 ("tcp: ack immediately when a cwr
packet arrives") calls tcp_enter_quickack_mode to force sending
two immediate ACKs upon receiving a packet w/ CWR flag. The side
effect is it'll also reset the delayed ACK timer and interactive
session tracking. This patch removes that side effect by using the
new ACK_NOW flag to force an immmediate ACK.

Packetdrill to demonstrate:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < [ect0] SEW 0:0(0) win 32792 
   +0 > SE. 0:0(0) ack 1 
  +.1 < [ect0] . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 < [ect0] . 1:1001(1000) ack 1 win 257
   +0 > [ect01] . 1:1(0) ack 1001

   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 1:2(1) ack 1001

   +0 < [ect0] . 1001:2001(1000) ack 2 win 257
   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 2:3(1) ack 2001

   +0 < [ect0] . 2001:3001(1000) ack 3 win 257
   +0 < [ect0] . 3001:4001(1000) ack 3 win 257
   // Ack delayed ...

   +.01 < [ce] P. 4001:4501(500) ack 3 win 257
   +0 > [ect01] . 3:3(0) ack 4001
   +0 > [ect01] E. 3:3(0) ack 4501

+.001 read(4, ..., 4500) = 4500
   +0 write(4, ..., 1) = 1
   +0 > [ect01] PE. 3:4(1) ack 4501 win 100

 +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
   // No delayed ACK on CWR flag
   +0 > [ect01] . 4:4(0) ack 5501

 +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
   +0 > [ect01] . 4:4(0) ack 6501


Fixes: 9aee40006190 ("tcp: ack immediately when a cwr packet arrives")
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_input.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a09ff3afef2..4c2dd9f863f7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -245,16 +245,16 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
 }
 
-static void tcp_ecn_accept_cwr(struct tcp_sock *tp, const struct sk_buff *skb)
+static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb)
 {
if (tcp_hdr(skb)->cwr) {
-   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+   tcp_sk(sk)->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
 
/* If the sender is telling us it has entered CWR, then its
 * cwnd may be very low (even just 1 packet), so we should ACK
 * immediately.
 */
-   tcp_enter_quickack_mode((struct sock *)tp, 2);
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
}
 }
 
@@ -4703,7 +4703,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
skb_dst_drop(skb);
__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
-   tcp_ecn_accept_cwr(tp, skb);
+   tcp_ecn_accept_cwr(sk, skb);
 
tp->rx_opt.dsack = 0;
 
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 2/4] tcp: avoid resetting ACK timer in DCTCP

2018-08-09 Thread Yuchung Cheng

The recent fix of acking immediately in DCTCP on CE status change
has an undesirable side-effect: it also resets TCP ack timer and
disables pingpong mode (interactive session). But the CE status
change has nothing to do with them. This patch addresses that by
using the new one-time immediate ACK flag instead of calling
tcp_enter_quickack_mode().

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_dctcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 8b637f9f23a2..ca61e2a659e7 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -136,7 +136,7 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)
 */
if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
__tcp_send_ack(sk, ca->prior_rcv_nxt);
-   tcp_enter_quickack_mode(sk, 1);
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
}
 
ca->prior_rcv_nxt = tp->rcv_nxt;
@@ -157,7 +157,7 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
 */
if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
__tcp_send_ack(sk, ca->prior_rcv_nxt);
-   tcp_enter_quickack_mode(sk, 1);
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
}
 
ca->prior_rcv_nxt = tp->rcv_nxt;
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 0/4] new mechanism to ACK immediately

2018-08-09 Thread Yuchung Cheng

This patch is a follow-up feature improvement to the recent fixes on
the performance issues in ECN (delayed) ACKs. Many of the fixes use
tcp_enter_quickack_mode routine to force immediate ACKs. However the
routine also reset tracking interactive session. This is not ideal
because these immediate ACKs are required by protocol specifics
unrelated to the interactiveness nature of the application.

This patch set introduces a new flag to send a one-time immediate ACK
without changing the status of interactive session tracking. With this
patch set the immediate ACKs are generated upon these protocol states:

1) When a hole is repaired
2) When CE status changes between subsequent data packets received
3) When a data packet carries CWR flag

Yuchung Cheng (4):
  tcp: mandate a one-time immediate ACK
  tcp: avoid resetting ACK timer in DCTCP
  tcp: always ACK immediately on hole repairs
  tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag

 include/net/inet_connection_sock.h |  3 ++-
 net/ipv4/tcp_dctcp.c   |  4 ++--
 net/ipv4/tcp_input.c   | 16 +---
 3 files changed, 13 insertions(+), 10 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 1/4] tcp: mandate a one-time immediate ACK

2018-08-09 Thread Yuchung Cheng

Add a new flag to indicate a one-time immediate ACK. This flag is
occasionaly set under specific TCP protocol states in addition to
the more common quickack mechanism for interactive application.

In several cases in the TCP code we want to force an immediate ACK
but do not want to call tcp_enter_quickack_mode() because we do
not want to forget the icsk_ack.pingpong or icsk_ack.ato state.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
---
 include/net/inet_connection_sock.h | 3 ++-
 net/ipv4/tcp_input.c   | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 0a6c9e0f2b5a..fa43b82607d9 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -167,7 +167,8 @@ enum inet_csk_ack_state_t {
ICSK_ACK_SCHED  = 1,
ICSK_ACK_TIMER  = 2,
ICSK_ACK_PUSHED = 4,
-   ICSK_ACK_PUSHED2 = 8
+   ICSK_ACK_PUSHED2 = 8,
+   ICSK_ACK_NOW = 16   /* Send the next ACK immediately (once) */
 };
 
 void inet_csk_init_xmit_timers(struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 715d541b52dd..b8849588c440 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5179,7 +5179,9 @@ static void __tcp_ack_snd_check(struct sock *sk, int 
ofo_possible)
(tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
 __tcp_select_window(sk) >= tp->rcv_wnd)) ||
/* We ACK each frame or... */
-   tcp_in_quickack_mode(sk)) {
+   tcp_in_quickack_mode(sk) ||
+   /* Protocol state mandates a one-time immediate ACK */
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
 send_now:
tcp_send_ack(sk);
return;
-- 
2.18.0.597.ga71716f1ad-goog

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-24 Thread Yuchung Cheng

On Mon, Jul 23, 2018 at 7:23 PM, Daniel Borkmann  wrote:
>
> On 07/24/2018 04:15 AM, Neal Cardwell wrote:
> > On Mon, Jul 23, 2018 at 8:49 PM Lawrence Brakmo  wrote:
> >>
> >> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> >> problem is triggered when the last packet of a request arrives CE
> >> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> >> to 1 (because there are no packets in flight). When the 1st packet of
> >> the next request arrives, the ACK was sometimes delayed even though it
> >> is CWR marked, adding up to 40ms to the RPC latency.
> >>
> >> This patch insures that CWR marked data packets arriving will be acked
> >> immediately.
> > ...
> >> Modified based on comments by Neal Cardwell 
> >>
> >> Signed-off-by: Lawrence Brakmo 
> >> ---
> >>  net/ipv4/tcp_input.c | 9 -
> >>  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > Seems like a nice mechanism to have, IMHO.
> >
> > Acked-by: Neal Cardwell 
>
> Should this go to net tree instead where all the other fixes went?
I am neutral but this feels more like a feature improvement

Acked-by: Yuchung Cheng 

btw this should also help the classic ECN case upon timeout that
triggers one packet retransmission.
>
> Thanks,
> Daniel

[PATCH net 2/3] tcp: do not cancel delay-AcK on DCTCP special ACK

2018-07-18 Thread Yuchung Cheng

Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).

Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.

The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.

Reported-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
---
 include/net/tcp.h |  1 +
 net/ipv4/tcp_dctcp.c  | 34 --
 net/ipv4/tcp_output.c | 10 +++---
 3 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3482d13d655b..a08de496d1b2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -539,6 +539,7 @@ void tcp_send_fin(struct sock *sk);
 void tcp_send_active_reset(struct sock *sk, gfp_t priority);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5869f89ca656..078328afbfe3 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -133,21 +133,8 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)
 * ACK has not sent yet.
 */
if (!ca->ce_state &&
-   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
-   u32 tmp_rcv_nxt;
-
-   /* Save current rcv_nxt. */
-   tmp_rcv_nxt = tp->rcv_nxt;
-
-   /* Generate previous ack with CE=0. */
-   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
-   tp->rcv_nxt = ca->prior_rcv_nxt;
-
-   tcp_send_ack(sk);
-
-   /* Recover current rcv_nxt. */
-   tp->rcv_nxt = tmp_rcv_nxt;
-   }
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
+   __tcp_send_ack(sk, ca->prior_rcv_nxt);
 
ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
@@ -164,21 +151,8 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
 * ACK has not sent yet.
 */
if (ca->ce_state &&
-   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
-   u32 tmp_rcv_nxt;
-
-   /* Save current rcv_nxt. */
-   tmp_rcv_nxt = tp->rcv_nxt;
-
-   /* Generate previous ack with CE=1. */
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
-   tp->rcv_nxt = ca->prior_rcv_nxt;
-
-   tcp_send_ack(sk);
-
-   /* Recover current rcv_nxt. */
-   tp->rcv_nxt = tmp_rcv_nxt;
-   }
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
+   __tcp_send_ack(sk, ca->prior_rcv_nxt);
 
ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ee1b0705321d..c4172c1fb198 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -160,7 +160,8 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
 }
 
 /* Account for an ACK we sent. */
-static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
+static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts,
+ u32 rcv_nxt)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -171,6 +172,9 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts)
if (hrtimer_try_to_cancel(>compressed_ack_timer) == 1)
__sock_put(sk);
}
+
+   if (unlikely(rcv_nxt != tp->rcv_nxt))
+   return;  /* Special ACK sent by DCTCP to reflect ECN */
tcp_dec_quickack_mode(sk, pkts);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
@@ -1141,7 +1145,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
icsk->icsk_af_ops->send_check(sk, skb);
 
if (likely(tcb->tcp_flags & TCPHDR_ACK))
-   tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
+   tcp_event_ack_sent(sk, tcp_skb_pcount(skb), rcv_nxt);
 
if (skb->len != tcp_header_size) {
tcp_event_data_sent(tp, sk);
@@ -3613,12 +3617,12 @@ void __tcp_send_ack(struct sock *sk, u32 r

[PATCH net 3/3] tcp: do not delay ACK in DCTCP upon CE status change

2018-07-18 Thread Yuchung Cheng

Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:

""" When receiving packets, the CE codepoint MUST be processed as follows:

   1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
   true and send an immediate ACK.

   2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
   to false and send an immediate ACK.
"""

Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.

Tested with this packetdrill script provided by Larry Brakmo

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
   +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257

+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001

+0.500 < F. 9501:9501(0) ack 4 win 257

Signed-off-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
---
 include/net/tcp.h|  1 +
 net/ipv4/tcp_dctcp.c | 30 ++
 net/ipv4/tcp_input.c |  3 ++-
 3 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a08de496d1b2..25116ec02087 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -342,6 +342,7 @@ ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
 
+void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks);
 static inline void tcp_dec_quickack_mode(struct sock *sk,
 const unsigned int pkts)
 {
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 078328afbfe3..8b637f9f23a2 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -129,12 +129,15 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)
struct dctcp *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   /* State has changed from CE=0 to CE=1 and delayed
-* ACK has not sent yet.
-*/
-   if (!ca->ce_state &&
-   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
-   __tcp_send_ack(sk, ca->prior_rcv_nxt);
+   if (!ca->ce_state) {
+   /* State has changed from CE=0 to CE=1, force an immediate
+* ACK to reflect the new CE state. If an ACK was delayed,
+* send that first to reflect the prior CE state.
+*/
+   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
+   __tcp_send_ack(sk, ca->prior_rcv_nxt);
+   tcp_enter_quickack_mode(sk, 1);
+   }
 
ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
@@ -147,12 +150,15 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
struct dctcp *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   /* State has changed from CE=1 to CE=0 and delayed
-* ACK has not sent yet.
-*/
-   if (ca->ce_state &&
-   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
-   __tcp_send_ack(sk, ca->prior_rcv_nxt);
+   if (ca->ce_state) {
+   /* State has changed from CE=1 to CE=0, force an immediate
+* ACK to reflect the new CE state. If an ACK was delayed,
+* send that first to reflect the prior CE state.
+*/
+   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
+   __tcp_send_ack(sk, ca->prior_rcv_nxt);
+   tcp_enter_quickack_mode(sk, 1);
+   }
 
ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8e5522c6833a..6bade06aaf72 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -215,7 +215,7 @@ static void tcp_incr_quickack(struct sock *sk, unsigned int 
max_quickacks)
icsk->icsk_ack.quick = quickacks;
 }
 
-static void tcp_enter_quickack_mode(struct sock *sk, unsigned int 
max_quickacks)
+void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
 
@@ -223,6 +223,7 @@ static void tcp_enter_quickack_mode(struct sock *sk, 
unsi

[PATCH net 1/3] tcp: helpers to send special DCTCP ack

2018-07-18 Thread Yuchung Cheng

Refactor and create helpers to send the special ACK in DCTCP.

Signed-off-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
---
 net/ipv4/tcp_output.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 00e5a300ddb9..ee1b0705321d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1023,8 +1023,8 @@ static void tcp_update_skb_after_send(struct tcp_sock 
*tp, struct sk_buff *skb)
  * We are working here with either a clone of the original
  * SKB, or a fresh unique copy made by the retransmit engine.
  */
-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-   gfp_t gfp_mask)
+static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
+ int clone_it, gfp_t gfp_mask, u32 rcv_nxt)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
struct inet_sock *inet;
@@ -1100,7 +1100,7 @@ static int tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb, int clone_it,
th->source  = inet->inet_sport;
th->dest= inet->inet_dport;
th->seq = htonl(tcb->seq);
-   th->ack_seq = htonl(tp->rcv_nxt);
+   th->ack_seq = htonl(rcv_nxt);
*(((__be16 *)th) + 6)   = htons(((tcp_header_size >> 2) << 12) |
tcb->tcp_flags);
 
@@ -1178,6 +1178,13 @@ static int tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb, int clone_it,
return err;
 }
 
+static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+   gfp_t gfp_mask)
+{
+   return __tcp_transmit_skb(sk, skb, clone_it, gfp_mask,
+ tcp_sk(sk)->rcv_nxt);
+}
+
 /* This routine just queues the buffer for sending.
  *
  * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
@@ -3571,7 +3578,7 @@ void tcp_send_delayed_ack(struct sock *sk)
 }
 
 /* This routine sends an ack and also updates the window. */
-void tcp_send_ack(struct sock *sk)
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
 {
struct sk_buff *buff;
 
@@ -3604,7 +3611,12 @@ void tcp_send_ack(struct sock *sk)
skb_set_tcp_pure_ack(buff);
 
/* Send it off, this clears delayed acks for us. */
-   tcp_transmit_skb(sk, buff, 0, (__force gfp_t)0);
+   __tcp_transmit_skb(sk, buff, 0, (__force gfp_t)0, rcv_nxt);
+}
+
+void tcp_send_ack(struct sock *sk)
+{
+   __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt);
 }
 EXPORT_SYMBOL_GPL(tcp_send_ack);
 
-- 
2.18.0.203.gfac676dfb9-goog

[PATCH net 0/3] fix DCTCP ECE Ack series

2018-07-18 Thread Yuchung Cheng

This patch set address that the existing DCTCP implementation does not
fully implement the ACK policy specified in the RFC. This improves
the responsiveness of CE status change particularly on flows with
small inflight.

Yuchung Cheng (3):
  tcp: helpers to send special ack
  tcp: do not cancel delay-AcK on DCTCP special ACK
  tcp: do not delay ACK in DCTCP upon CE status change

 include/net/tcp.h |  2 ++
 net/ipv4/tcp_dctcp.c  | 52 +--
 net/ipv4/tcp_input.c  |  3 ++-
 net/ipv4/tcp_output.c | 32 +++---
 4 files changed, 44 insertions(+), 45 deletions(-)

-- 
2.18.0.203.gfac676dfb9-goog

[PATCH net 2/2] tcp: remove DELAYED ACK events in DCTCP

2018-07-12 Thread Yuchung Cheng

After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
---
 include/net/tcp.h |  2 --
 net/ipv4/tcp_dctcp.c  | 25 -
 net/ipv4/tcp_output.c |  4 
 3 files changed, 31 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index af3ec72d5d41..3482d13d655b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -912,8 +912,6 @@ enum tcp_ca_event {
CA_EVENT_LOSS,  /* loss timeout */
CA_EVENT_ECN_NO_CE, /* ECT set, but not CE marked */
CA_EVENT_ECN_IS_CE, /* received CE marked IP packet */
-   CA_EVENT_DELAYED_ACK,   /* Delayed ack is sent */
-   CA_EVENT_NON_DELAYED_ACK,
 };
 
 /* Information about inbound ACK, passed to cong_ops->in_ack_event() */
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 89f88b0d8167..5869f89ca656 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -55,7 +55,6 @@ struct dctcp {
u32 dctcp_alpha;
u32 next_seq;
u32 ce_state;
-   u32 delayed_ack_reserved;
u32 loss_cwnd;
 };
 
@@ -96,7 +95,6 @@ static void dctcp_init(struct sock *sk)
 
ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
 
-   ca->delayed_ack_reserved = 0;
ca->loss_cwnd = 0;
ca->ce_state = 0;
 
@@ -250,25 +248,6 @@ static void dctcp_state(struct sock *sk, u8 new_state)
}
 }
 
-static void dctcp_update_ack_reserved(struct sock *sk, enum tcp_ca_event ev)
-{
-   struct dctcp *ca = inet_csk_ca(sk);
-
-   switch (ev) {
-   case CA_EVENT_DELAYED_ACK:
-   if (!ca->delayed_ack_reserved)
-   ca->delayed_ack_reserved = 1;
-   break;
-   case CA_EVENT_NON_DELAYED_ACK:
-   if (ca->delayed_ack_reserved)
-   ca->delayed_ack_reserved = 0;
-   break;
-   default:
-   /* Don't care for the rest. */
-   break;
-   }
-}
-
 static void dctcp_cwnd_event(struct sock *sk, enum tcp_ca_event ev)
 {
switch (ev) {
@@ -278,10 +257,6 @@ static void dctcp_cwnd_event(struct sock *sk, enum 
tcp_ca_event ev)
case CA_EVENT_ECN_NO_CE:
dctcp_ce_state_1_to_0(sk);
break;
-   case CA_EVENT_DELAYED_ACK:
-   case CA_EVENT_NON_DELAYED_ACK:
-   dctcp_update_ack_reserved(sk, ev);
-   break;
default:
/* Don't care for the rest. */
break;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8e08b409c71e..00e5a300ddb9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3523,8 +3523,6 @@ void tcp_send_delayed_ack(struct sock *sk)
int ato = icsk->icsk_ack.ato;
unsigned long timeout;
 
-   tcp_ca_event(sk, CA_EVENT_DELAYED_ACK);
-
if (ato > TCP_DELACK_MIN) {
const struct tcp_sock *tp = tcp_sk(sk);
int max_ato = HZ / 2;
@@ -3581,8 +3579,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;
 
-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.
-- 
2.18.0.203.gfac676dfb9-goog

[PATCH net 1/2] tcp: fix dctcp delayed ACK schedule

2018-07-12 Thread Yuchung Cheng

Previously, when a data segment was sent an ACK was piggybacked
on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
event to notify congestion control modules. So the DCTCP
ca->delayed_ack_reserved flag could incorrectly stay set when
in fact there were no delayed ACKs being reserved. This could result
in sending a special ECN notification ACK that carries an older
ACK sequence, when in fact there was no need for such an ACK.
DCTCP keeps track of the delayed ACK status with its own separate
state ca->delayed_ack_reserved. Previously it may accidentally cancel
the delayed ACK without updating this field upon sending a special
ACK that carries a older ACK sequence. This inconsistency would
lead to DCTCP receiver never acknowledging the latest data until the
sender times out and retry in some cases.

Packetdrill script (provided by Larry Brakmo)

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect01] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect01] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
// Previously the ACK sequence below would be 4501, causing a long RTO
+0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
+0 > [ect01] . 4:4(0) ack 6501 // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257

Reported-by: Larry Brakmo 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
---
 net/ipv4/tcp_dctcp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..89f88b0d8167 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -134,7 +134,8 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)
/* State has changed from CE=0 to CE=1 and delayed
 * ACK has not sent yet.
 */
-   if (!ca->ce_state && ca->delayed_ack_reserved) {
+   if (!ca->ce_state &&
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
u32 tmp_rcv_nxt;
 
/* Save current rcv_nxt. */
@@ -164,7 +165,8 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
/* State has changed from CE=1 to CE=0 and delayed
 * ACK has not sent yet.
 */
-   if (ca->ce_state && ca->delayed_ack_reserved) {
+   if (ca->ce_state &&
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
u32 tmp_rcv_nxt;
 
/* Save current rcv_nxt. */
-- 
2.18.0.203.gfac676dfb9-goog

[PATCH net 0/2] fix DCTCP delayed ACK

2018-07-12 Thread Yuchung Cheng

This patch series addresses the issue that sometimes DCTCP
fail to acknowledge the latest sequence and result in sender timeout
if inflight is small.

Yuchung Cheng (2):
  tcp: fix dctcp delayed ACK schedule
  tcp: remove DELAYED ACK events in DCTCP

 include/net/tcp.h |  2 --
 net/ipv4/tcp_dctcp.c  | 31 ---
 net/ipv4/tcp_output.c |  4 
 3 files changed, 4 insertions(+), 33 deletions(-)

-- 
2.18.0.203.gfac676dfb9-goog

Re: [PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP

2018-07-09 Thread Yuchung Cheng

On Sat, Jul 7, 2018 at 7:07 AM, Neal Cardwell  wrote:
> On Sat, Jul 7, 2018 at 7:15 AM David Miller  wrote:
>>
>> From: Lawrence Brakmo 
>> Date: Tue, 3 Jul 2018 09:26:13 -0700
>>
>> > When have observed high tail latencies when using DCTCP for RPCs as
>> > compared to using Cubic. For example, in one setup there are 2 hosts
>> > sending to a 3rd one, with each sender having 3 flows (1 stream,
>> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
>> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>> >
>> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>> > 1MB RPCs2.6ms   5.5ms 43ms  208ms
>> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
>>  ...
>> > v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in
>> > tcp_event_ack_sent. Based on Neal Cardwell 
>> > feedback.
>> > Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead of 
>> > modifying
>> > tcp_ack_send_check to insure an ACK when cwr is received.
>> > v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check.
>> >
>> > [PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent
>> > [PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet
>>
>> Neal and co., what are your thoughts right now about this patch series?
>>
>> Thank you.
>
> IMHO these patches are a definite improvement over what we have now.
>
> That said, in chatting with Yuchung before the July 4th break, I think
> Yuchung and I agreed that we would ideally like to see something like
> the following:
>
> (1) refactor the DCTCP code to check for pending delayed ACKs directly
> using existing state (inet_csk(sk)->icsk_ack.pending &
> ICSK_ACK_TIMER), and remove the ca->delayed_ack_reserved DCTCP field
> and the CA_EVENT_DELAYED_ACK and CA_EVENT_NON_DELAYED_ACK callbacks
> added for DCTCP (which Larry determined had at least one bug).
>
> (2) fix the bug with the DCTCP call to tcp_send_ack(sk) causing
> delayed ACKs to be incorrectly dropped/forgotten (not yet addressed by
> this patch series)
>
> (3) then with fixes (1) and (2) in place, re-run tests and see if we
> still need Larry's heuristic (in patch 2) to fire an ACK immediately
> if a receiver receives a CWR packet (I suspect this is still very
> useful, but I think Yuchung is reluctant to add this complexity unless
> we have verified it's still needed after (1) and (2))
>
> Our team may be able to help out with some proposed patches for (1) and (2).
>
> In any case, I would love to have Yuchung and Eric weigh in (perhaps
> Monday) before we merge this patch series.
Thanks Neal. Sorry for not reflecting these timely before I took off
for July 4 holidays. I was going to post the same comment - Larry: I
could provide draft patches if that helps.

>
> Thanks,
> neal

Re: [PATCH net-next] tcp: expose both send and receive intervals for rate sample

2018-07-09 Thread Yuchung Cheng

On Mon, Jul 9, 2018 at 9:05 AM, Deepti Raghavan  wrote:
> Congestion control algorithms, which access the rate sample
> through the tcp_cong_control function, only have access to the maximum
> of the send and receive interval, for cases where the acknowledgment
> rate may be inaccurate due to ACK compression or decimation. Algorithms
> may want to use send rates and receive rates as separate signals.
>
> Signed-off-by: Deepti Raghavan 
Acked-by: Yuchung Cheng 
> ---
>  include/net/tcp.h   | 2 ++
>  net/ipv4/tcp_rate.c | 4 
>  2 files changed, 6 insertions(+)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index cce3769..f6cb20e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -954,6 +954,8 @@ struct rate_sample {
>   u32  prior_delivered; /* tp->delivered at "prior_mstamp" */
>   s32  delivered; /* number of packets delivered over interval */
>   long interval_us; /* time for tp->delivered to incr "delivered" */
> + u32 snd_interval_us; /* snd interval for delivered packets */
> + u32 rcv_interval_us; /* rcv interval for delivered packets */
>   long rtt_us; /* RTT of last (S)ACKed packet (or -1) */
>   int  losses; /* number of packets marked lost upon ACK */
>   u32  acked_sacked; /* number of packets newly (S)ACKed upon ACK */
> diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
> index c61240e..4dff40d 100644
> --- a/net/ipv4/tcp_rate.c
> +++ b/net/ipv4/tcp_rate.c
> @@ -146,6 +146,10 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32
> lost,
>   rs->prior_mstamp); /* ack phase */
>   rs->interval_us = max(snd_us, ack_us);
>
> + /* Record both segment send and ack receive intervals */
> + rs->snd_interval_us = snd_us;
> + rs->rcv_interval_us = ack_us;
> +
>   /* Normally we expect interval_us >= min-rtt.
>* Note that rate may still be over-estimated when a spuriously
>* retransmistted skb was first (s)acked because "interval_us"
> --
> 2.7.4
>

Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-02 Thread Yuchung Cheng

On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
Sorry to chime-in late here (lame excuse: IETF deadline)

IIRC this issue would exist prior to 4.11 kernel. While it'd be good
to fix that, it's not clear which patch introduced the regression
between 4.11 and 4.16? I assume you tested Eric's most recent quickack
fix.

In terms of the fix itself, it seems odd the tcp_send_ack() call in
DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
"prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
cancelled, because that prior-ACK really is a supplementary old ACK.

But it's still unclear how this bug introduces the regression 4.11 - 4.16


>
> Improved based on comments from Neal Cardwell .
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_output.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..acefb64e8280 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
> unsigned int pkts)
> __sock_put(sk);
> }
> tcp_dec_quickack_mode(sk, pkts);
> +   if (inet_csk_ack_scheduled(sk))
> +   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }
>
> @@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
> if (sk->sk_state == TCP_CLOSE)
> return;
>
> -   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> -
> /* We are not putting this on the write queue, so
>  * tcp_transmit_skb() will set the ownership to this
>  * sock.
> --
> 2.17.1
>

[PATCH net] tcp: fix Fast Open key endianness

2018-06-27 Thread Yuchung Cheng

Fast Open key could be stored in different endian based on the CPU.
Previously hosts in different endianness in a server farm using
the same key config (sysctl value) would produce different cookies.
This patch fixes it by always storing it as little endian to keep
same API for LE hosts.

Reported-by: Daniele Iamartino 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/sysctl_net_ipv4.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d06247ba08b2..af0a857d8352 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -265,8 +265,9 @@ static int proc_tcp_fastopen_key(struct ctl_table *table, 
int write,
ipv4.sysctl_tcp_fastopen);
struct ctl_table tbl = { .maxlen = (TCP_FASTOPEN_KEY_LENGTH * 2 + 10) };
struct tcp_fastopen_context *ctxt;
-   int ret;
u32  user_key[4]; /* 16 bytes, matching TCP_FASTOPEN_KEY_LENGTH */
+   __le32 key[4];
+   int ret, i;
 
tbl.data = kmalloc(tbl.maxlen, GFP_KERNEL);
if (!tbl.data)
@@ -275,11 +276,14 @@ static int proc_tcp_fastopen_key(struct ctl_table *table, 
int write,
rcu_read_lock();
ctxt = rcu_dereference(net->ipv4.tcp_fastopen_ctx);
if (ctxt)
-   memcpy(user_key, ctxt->key, TCP_FASTOPEN_KEY_LENGTH);
+   memcpy(key, ctxt->key, TCP_FASTOPEN_KEY_LENGTH);
else
-   memset(user_key, 0, sizeof(user_key));
+   memset(key, 0, sizeof(key));
rcu_read_unlock();
 
+   for (i = 0; i < ARRAY_SIZE(key); i++)
+   user_key[i] = le32_to_cpu(key[i]);
+
snprintf(tbl.data, tbl.maxlen, "%08x-%08x-%08x-%08x",
user_key[0], user_key[1], user_key[2], user_key[3]);
ret = proc_dostring(, write, buffer, lenp, ppos);
@@ -290,13 +294,17 @@ static int proc_tcp_fastopen_key(struct ctl_table *table, 
int write,
ret = -EINVAL;
goto bad_key;
}
-   tcp_fastopen_reset_cipher(net, NULL, user_key,
+
+   for (i = 0; i < ARRAY_SIZE(user_key); i++)
+   key[i] = cpu_to_le32(user_key[i]);
+
+   tcp_fastopen_reset_cipher(net, NULL, key,
  TCP_FASTOPEN_KEY_LENGTH);
}
 
 bad_key:
pr_debug("proc FO key set 0x%x-%x-%x-%x <- 0x%s: %u\n",
-  user_key[0], user_key[1], user_key[2], user_key[3],
+   user_key[0], user_key[1], user_key[2], user_key[3],
   (char *)tbl.data, ret);
kfree(tbl.data);
return ret;
-- 
2.18.0.rc2.346.g013aa6912e-goog

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-27 Thread Yuchung Cheng

On Wed, Jun 27, 2018 at 1:00 PM, Lawrence Brakmo  wrote:
>
>
> From:  on behalf of Yuchung Cheng
> 
> Date: Wednesday, June 27, 2018 at 9:59 AM
> To: Neal Cardwell 
> Cc: Lawrence Brakmo , Matt Mathis ,
> Netdev , Kernel Team , Blake
> Matheny , Alexei Starovoitov , Eric Dumazet
> , Wei Wang 
> Subject: Re: [PATCH net-next v2] tcp: force cwnd at least 2 in
> tcp_cwnd_reduction
>
>
>
> On Wed, Jun 27, 2018 at 8:24 AM, Neal Cardwell  wrote:
>
> On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo  wrote:
>
> The only issue is if it is safe to always use 2 or if it is better to
>
> use min(2, snd_ssthresh) (which could still trigger the problem).
>
>
>
> Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
>
> that should be the same as just 2, since:
>
>
>
> (a) RFCs mandate ssthresh should not be below 2, e.g.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_rfc5681=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_g=PZQC-6NGqK6QEVrf1WhlAD3mQt7tK8aqrfQGp93dJy4=WH9fXpDBHC31NCixiiMvhOb2UXCuJIxUiY4IXyJTgpc=
> page 7:
>
>
>
>   ssthresh = max (FlightSize / 2, 2*SMSS)(4)
>
>
>
> (b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
>
> that constraint, and always have an ssthresh of at least 2.
>
>
>
> And if some CC misbehaves and uses a lower ssthresh, then taking
>
> min(2, snd_ssthresh) will trigger problems, as you note.
>
>
>
> +   tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
>
>
>
> AFAICT this does seem like it will make the sender behavior more
>
> aggressive in cases with high loss and/or a very low per-flow
>
> fair-share.
>
>
>
> Old:
>
>
>
> o send N packets
>
> o receive SACKs for last 3 packets
>
> o fast retransmit packet 1
>
> o using ACKs, slow-start upward
>
>
>
> New:
>
>
>
> o send N packets
>
> o receive SACKs for last 3 packets
>
> o fast retransmit packets 1 and 2
>
> o using ACKs, slow-start upward
>
>
>
> In the extreme case, if the available fair share is less than 2
>
> packets, whereas inflight would have oscillated between 1 packet and 2
>
> packets with the existing code, it now seems like with this commit the
>
> inflight will now hover at 2. It seems like this would have
>
> significantly higher losses than we had with the existing code.
>
> I share similar concern. Note that this function is used by most
>
> existing congestion control modules beside DCTCP so I am more cautious
>
> of changing this to address DCTCP issue.
>
>
>
> Theoretically it could happen with any CC, but I have only seen it
> consistently with DCTCP.
>
> One option, as I mentioned to Neal, is to use 2 only when cwnd reduction is
> caused by
>
> ECN signal.
>
>
>
> One problem that DCTCP paper notices when cwnd = 1 is still too big
>
> when the bottleneck
>
> is shared by many flows (e.g. incast). It specifically suggest
>
> changing the lower-bound of 2 in the spec to 1. (Section 8.2).
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_system_files_conference_nsdi15_nsdi15-2Dpaper-2Djudd.pdf=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_g=PZQC-6NGqK6QEVrf1WhlAD3mQt7tK8aqrfQGp93dJy4=fO8p5khks-sGO_lwEvKicWtFz_q9LI-ecwZcS1nyZ7w=
>
>
>
> One could even envision using even lower values than 1, where we send one
> packet
>
> every 2 RTTs. However, I do not think this is a problem that needs to be
> fixed.
>
>
>
> I am curious about the differences you observe in 4.11 and 4.16. I
>
> wasn't aware of any (significant) change in tcp_cwnd_reduction / PRR
>
> algorithm between 4.11 and 4.16. Also the receiver should not delay
>
> ACKs if it has out-of-order packet or receiving CE data packets. This
>
> means the delayed ACK is by tail losses and the last data received
>
> carries no CE mark: seems a less common scenario?
>
>
>
> I have the feeling the differences between 4.11 and 4.16 are due to
> accounting
>
> changes and not change in CC behaviors.
If the difference is not due to change in PRR, I don't think we
should change PRR to address that until we have some evidence.

Could you share some traces?

>
>
>
> The packet and ACK are not lost. I captured pcaps on both hosts and could
> see
>
> cases when the data packet arrived, but the ACK was not sent before the
> packet was
>
> retransmitted. I saw this behavior multiple times. In many cases the ACK did
> not show
>
> up in the pcap within the 40ms one would usually expect with a delayed ACK.
>
> If delayed-ACK is the problem, we probably should fix the receiver to
>
> delay ACK more intelligently, not the sender. wei...@google.com is
>
> working on it.
>
>
>
> Not sure if it is delayed-ACK or something caused the ACK to not be sent
> since it did
>
> not show up within 40ms as one would usually expect for a delayed ACK.
>
>
>
>
>
>
>
> This may or may not be OK in practice, but IMHO it is worth mentioning
>
> and discussing.
>
>
>
> neal
>
>

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-27 Thread Yuchung Cheng

On Wed, Jun 27, 2018 at 8:24 AM, Neal Cardwell  wrote:
> On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo  wrote:
>> The only issue is if it is safe to always use 2 or if it is better to
>> use min(2, snd_ssthresh) (which could still trigger the problem).
>
> Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
> that should be the same as just 2, since:
>
> (a) RFCs mandate ssthresh should not be below 2, e.g.
> https://tools.ietf.org/html/rfc5681 page 7:
>
>  ssthresh = max (FlightSize / 2, 2*SMSS)(4)
>
> (b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
> that constraint, and always have an ssthresh of at least 2.
>
> And if some CC misbehaves and uses a lower ssthresh, then taking
> min(2, snd_ssthresh) will trigger problems, as you note.
>
>> +   tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
>
> AFAICT this does seem like it will make the sender behavior more
> aggressive in cases with high loss and/or a very low per-flow
> fair-share.
>
> Old:
>
> o send N packets
> o receive SACKs for last 3 packets
> o fast retransmit packet 1
> o using ACKs, slow-start upward
>
> New:
>
> o send N packets
> o receive SACKs for last 3 packets
> o fast retransmit packets 1 and 2
> o using ACKs, slow-start upward
>
> In the extreme case, if the available fair share is less than 2
> packets, whereas inflight would have oscillated between 1 packet and 2
> packets with the existing code, it now seems like with this commit the
> inflight will now hover at 2. It seems like this would have
> significantly higher losses than we had with the existing code.
I share similar concern. Note that this function is used by most
existing congestion control modules beside DCTCP so I am more cautious
of changing this to address DCTCP issue.

One problem that DCTCP paper notices when cwnd = 1 is still too big
when the bottleneck
is shared by many flows (e.g. incast). It specifically suggest
changing the lower-bound of 2 in the spec to 1. (Section 8.2).
https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.pdf

I am curious about the differences you observe in 4.11 and 4.16. I
wasn't aware of any (significant) change in tcp_cwnd_reduction / PRR
algorithm between 4.11 and 4.16. Also the receiver should not delay
ACKs if it has out-of-order packet or receiving CE data packets. This
means the delayed ACK is by tail losses and the last data received
carries no CE mark: seems a less common scenario?

If delayed-ACK is the problem, we probably should fix the receiver to
delay ACK more intelligently, not the sender. wei...@google.com is
working on it.

>
> This may or may not be OK in practice, but IMHO it is worth mentioning
> and discussing.
>
> neal

Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child

2018-06-26 Thread Yuchung Cheng

On Tue, Jun 26, 2018 at 8:45 AM, Eric Dumazet  wrote:
> Signed-off-by: Eric Dumazet 
> ---
nice refactor!
Acked-by: Yuchung Cheng 

>  net/ipv4/tcp_minisocks.c | 223 ---
>  1 file changed, 113 insertions(+), 110 deletions(-)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 
> 1dda1341a223937580b4efdbedb21ae50b221ff7..dac5893a52b4520d86ed2fcadbfb561a559fcd3d
>  100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -449,119 +449,122 @@ struct sock *tcp_create_openreq_child(const struct 
> sock *sk,
>   struct sk_buff *skb)
>  {
> struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
> -
> -   if (newsk) {
> -   const struct inet_request_sock *ireq = inet_rsk(req);
> -   struct tcp_request_sock *treq = tcp_rsk(req);
> -   struct inet_connection_sock *newicsk = inet_csk(newsk);
> -   struct tcp_sock *newtp = tcp_sk(newsk);
> -   struct tcp_sock *oldtp = tcp_sk(sk);
> -
> -   smc_check_reset_syn_req(oldtp, req, newtp);
> -
> -   /* Now setup tcp_sock */
> -   newtp->pred_flags = 0;
> -
> -   newtp->rcv_wup = newtp->copied_seq =
> -   newtp->rcv_nxt = treq->rcv_isn + 1;
> -   newtp->segs_in = 1;
> -
> -   newtp->snd_sml = newtp->snd_una =
> -   newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1;
> -
> -   INIT_LIST_HEAD(>tsq_node);
> -   INIT_LIST_HEAD(>tsorted_sent_queue);
> -
> -   tcp_init_wl(newtp, treq->rcv_isn);
> -
> -   newtp->srtt_us = 0;
> -   newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> -   minmax_reset(>rtt_min, tcp_jiffies32, ~0U);
> -   newicsk->icsk_rto = TCP_TIMEOUT_INIT;
> -   newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
> -
> -   newtp->packets_out = 0;
> -   newtp->retrans_out = 0;
> -   newtp->sacked_out = 0;
> -   newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
> -   newtp->tlp_high_seq = 0;
> -   newtp->lsndtime = tcp_jiffies32;
> -   newsk->sk_txhash = treq->txhash;
> -   newtp->last_oow_ack_time = 0;
> -   newtp->total_retrans = req->num_retrans;
> -
> -   /* So many TCP implementations out there (incorrectly) count 
> the
> -* initial SYN frame in their delayed-ACK and congestion 
> control
> -* algorithms that we must have the following bandaid to talk
> -* efficiently to them.  -DaveM
> -*/
> -   newtp->snd_cwnd = TCP_INIT_CWND;
> -   newtp->snd_cwnd_cnt = 0;
> -
> -   /* There's a bubble in the pipe until at least the first ACK. 
> */
> -   newtp->app_limited = ~0U;
> -
> -   tcp_init_xmit_timers(newsk);
> -   newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
> -
> -   newtp->rx_opt.saw_tstamp = 0;
> -
> -   newtp->rx_opt.dsack = 0;
> -   newtp->rx_opt.num_sacks = 0;
> -
> -   newtp->urg_data = 0;
> -
> -   if (sock_flag(newsk, SOCK_KEEPOPEN))
> -   inet_csk_reset_keepalive_timer(newsk,
> -  
> keepalive_time_when(newtp));
> -
> -   newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
> -   newtp->rx_opt.sack_ok = ireq->sack_ok;
> -   newtp->window_clamp = req->rsk_window_clamp;
> -   newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> -   newtp->rcv_wnd = req->rsk_rcv_wnd;
> -   newtp->rx_opt.wscale_ok = ireq->wscale_ok;
> -   if (newtp->rx_opt.wscale_ok) {
> -   newtp->rx_opt.snd_wscale = ireq->snd_wscale;
> -   newtp->rx_opt.rcv_wscale = ireq->rcv_wscale;
> -   } else {
> -   newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 
> 0;
> -   newtp->window_clamp = min(newtp->window_clamp, 
> 65535U);
> -   }
> -   newtp->snd_wnd = (ntohs(tcp_hdr(skb)->window) <<
> - newtp->rx_opt.snd_wscale);
> -   newtp->max_window = newtp->snd_wnd;
> -
> -   if (

Re: [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events

2018-05-22 Thread Yuchung Cheng

On Mon, May 21, 2018 at 3:08 PM, Eric Dumazet <eduma...@google.com> wrote:
> ECN signals currently forces TCP to enter quickack mode for
> up to 16 (TCP_MAX_QUICKACKS) following incoming packets.
>
> We believe this is not needed, and only sending one immediate ack
> for the current packet should be enough.
>
> This should reduce the extra load noticed in DCTCP environments,
> after congestion events.
>
> This is part 2 of our effort to reduce pure ACK packets.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> ---
Acked-by: Yuchung Cheng <ych...@google.com>

Thanks for this patch. I am still wondering how much does the "funny
extension" help. but this patch definitely reduce the amount of
unnecessary immediate ACKs on ECN.

>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 2e970e9f4e09d966b703af2d14d521a4328eba7e..1191cac72109f2f7e2b688ddbc1d404151d274d6
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -263,7 +263,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const 
> struct sk_buff *skb)
>  * it is probably a retransmit.
>  */
> if (tp->ecn_flags & TCP_ECN_SEEN)
> -   tcp_enter_quickack_mode((struct sock *)tp, 
> TCP_MAX_QUICKACKS);
> +   tcp_enter_quickack_mode((struct sock *)tp, 1);
> break;
> case INET_ECN_CE:
> if (tcp_ca_needs_ecn((struct sock *)tp))
> @@ -271,7 +271,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const 
> struct sk_buff *skb)
>
> if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
> /* Better not delay acks, sender can have a very low 
> cwnd */
> -   tcp_enter_quickack_mode((struct sock *)tp, 
> TCP_MAX_QUICKACKS);
> +   tcp_enter_quickack_mode((struct sock *)tp, 1);
> tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
> }
> tp->ecn_flags |= TCP_ECN_SEEN;
> --
> 2.17.0.441.gb46fe60e1d-goog
>

Re: [PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Yuchung Cheng

On Thu, May 17, 2018 at 2:57 PM, Neal Cardwell <ncardw...@google.com> wrote:
> On Thu, May 17, 2018 at 5:47 PM Eric Dumazet <eduma...@google.com> wrote:
>
>> When TCP receives an out-of-order packet, it immediately sends
>> a SACK packet, generating network load but also forcing the
>> receiver to send 1-MSS pathological packets, increasing its
>> RTX queue length/depth, and thus processing time.
>
>> Wifi networks suffer from this aggressive behavior, but generally
>> speaking, all these SACK packets add fuel to the fire when networks
>> are under congestion.
>
>> This patch adds a high resolution timer and tp->compressed_ack counter.
>
>> Instead of sending a SACK, we program this timer with a small delay,
>> based on RTT and capped to 1 ms :
>
>>  delay = min ( 5 % of RTT, 1 ms)
>
>> If subsequent SACKs need to be sent while the timer has not yet
>> expired, we simply increment tp->compressed_ack.
>
>> When timer expires, a SACK is sent with the latest information.
>> Whenever an ACK is sent (if data is sent, or if in-order
>> data is received) timer is canceled.
>
>> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
>> if the sack blocks need to be shuffled, even if the timer has not
>> expired.
>
>> A new SNMP counter is added in the following patch.
>
>> Two other patches add sysctls to allow changing the 1,000,000 and 44
>> values that this commit hard-coded.
>
>> Signed-off-by: Eric Dumazet <eduma...@google.com>
>> ---
>
> Very nice. I like the constants and the min(rcv_rtt, srtt).
>
> Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>

Great work. Hopefully this would save middle-boxes' from handling
TCP-ACK themselves.

>
> Thanks!
>
> neal

Re: [PATCH net-next 3/4] tcp: add SACK compression

2018-05-17 Thread Yuchung Cheng

On Thu, May 17, 2018 at 9:59 AM, Yuchung Cheng <ych...@google.com> wrote:
> On Thu, May 17, 2018 at 9:41 AM, Neal Cardwell <ncardw...@google.com> wrote:
>>
>> On Thu, May 17, 2018 at 11:40 AM Eric Dumazet <eric.duma...@gmail.com>
>> wrote:
>> > On 05/17/2018 08:14 AM, Neal Cardwell wrote:
>> > > Is there a particular motivation for the cap of 127? IMHO 127 ACKs is
>> quite
>> > > a few to compress. Experience seems to show that it works well to have
>> one
>> > > GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It
>> might
>> > > be nice to try to match those dynamics in this SACK compression case,
>> so it
>> > > might be nice to cap the number of compressed ACKs at something like 44?
>> > > (0x / 1448 - 1).  That way for high-speed paths we could try to keep
>> > > the ACK clock going with ACKs for ~64KBytes that trigger a single TSO
>> skb
>> > > of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.
>>
>> > 127 was chosen because the field is u8, and since skb allocation for the
>> ACK
>> > can fail, we could have cases were the field goes above 127.
>>
>> > Ultimately, I believe a followup patch would add a sysctl, so that we can
>> fine-tune
>> > this, and eventually disable ACK compression if this sysctl is set to 0
>>
>> OK, a sysctl sounds good. I would still vote for a default of 44.  :-)
>>
>>
>> > >> +   if (hrtimer_is_queued(>compressed_ack_timer))
>> > >> +   return;
>> > >> +
>> > >> +   /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
>> > >> +
>> > >> +   delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
>> > >> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >>
>> 3)/20);
>> > >
>> > > Any particular motivation for the 2.5ms here? It might be nice to match
>> the
>> > > existing TSO autosizing dynamics and use 1ms here instead of having a
>> > > separate new constant of 2.5ms. Smaller time scales here should lead to
>> > > less burstiness and queue pressure from data packets in the network,
>> and we
>> > > know from experience that the CPU overhead of 1ms chunks is acceptable.
>>
>> > This came from my tests on wifi really :)
>>
>> > I also had the idea to make this threshold adjustable for wifi, like we
>> did for sk_pacing_shift.
>>
>> > (On wifi, we might want to increase the max delay between ACK)
>>
>> > So maybe use 1ms delay, when sk_pacing_shift == 10, but increase it if
>> sk_pacing_shift has been lowered.
>>
>> Sounds good to me.
>>
>> Thanks for implementing this! Overall this patch seems nice to me.
>>
>> Acked-by: Neal Cardwell <ncardw...@google.com>
>>
>> BTW, I guess we should spread the word to maintainers of other major TCP
>> stacks that they need to be prepared for what may be a much higher degree
>> of compression/aggregation in the SACK stream. Linux stacks going back many
>> years should be fine with this, but I'm not sure about the other major OSes
>> (they may only allow sending one MSS per ACK-with-SACKs received).
> Patch looks really good but Neal's comment just reminds me a potential
> legacy issue.
>
> I recall at least Apple and Windows TCP stacks still need 3+ DUPACKs
> (!= a SACK covering 3+ packets) to trigger fast recovery. Will we have
> an issue there interacting w/ these stacks?
Offline chat w/ Eric: actually the problem already exists with GRO: a
Linux receiver could receive a OOO skb of say 5 pkts and returns one
(DUP)ACK w/ sack option covering 5 pkts.

Since no issues have been reported my concern is probably not big
deal. Hopefully other stacks can improve their sack / recovery
handling there.

>
>>
>> neal

Re: [PATCH net-next 3/4] tcp: add SACK compression

2018-05-17 Thread Yuchung Cheng

On Thu, May 17, 2018 at 9:41 AM, Neal Cardwell  wrote:
>
> On Thu, May 17, 2018 at 11:40 AM Eric Dumazet 
> wrote:
> > On 05/17/2018 08:14 AM, Neal Cardwell wrote:
> > > Is there a particular motivation for the cap of 127? IMHO 127 ACKs is
> quite
> > > a few to compress. Experience seems to show that it works well to have
> one
> > > GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It
> might
> > > be nice to try to match those dynamics in this SACK compression case,
> so it
> > > might be nice to cap the number of compressed ACKs at something like 44?
> > > (0x / 1448 - 1).  That way for high-speed paths we could try to keep
> > > the ACK clock going with ACKs for ~64KBytes that trigger a single TSO
> skb
> > > of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.
>
> > 127 was chosen because the field is u8, and since skb allocation for the
> ACK
> > can fail, we could have cases were the field goes above 127.
>
> > Ultimately, I believe a followup patch would add a sysctl, so that we can
> fine-tune
> > this, and eventually disable ACK compression if this sysctl is set to 0
>
> OK, a sysctl sounds good. I would still vote for a default of 44.  :-)
>
>
> > >> +   if (hrtimer_is_queued(>compressed_ack_timer))
> > >> +   return;
> > >> +
> > >> +   /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
> > >> +
> > >> +   delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
> > >> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >>
> 3)/20);
> > >
> > > Any particular motivation for the 2.5ms here? It might be nice to match
> the
> > > existing TSO autosizing dynamics and use 1ms here instead of having a
> > > separate new constant of 2.5ms. Smaller time scales here should lead to
> > > less burstiness and queue pressure from data packets in the network,
> and we
> > > know from experience that the CPU overhead of 1ms chunks is acceptable.
>
> > This came from my tests on wifi really :)
>
> > I also had the idea to make this threshold adjustable for wifi, like we
> did for sk_pacing_shift.
>
> > (On wifi, we might want to increase the max delay between ACK)
>
> > So maybe use 1ms delay, when sk_pacing_shift == 10, but increase it if
> sk_pacing_shift has been lowered.
>
> Sounds good to me.
>
> Thanks for implementing this! Overall this patch seems nice to me.
>
> Acked-by: Neal Cardwell 
>
> BTW, I guess we should spread the word to maintainers of other major TCP
> stacks that they need to be prepared for what may be a much higher degree
> of compression/aggregation in the SACK stream. Linux stacks going back many
> years should be fine with this, but I'm not sure about the other major OSes
> (they may only allow sending one MSS per ACK-with-SACKs received).
Patch looks really good but Neal's comment just reminds me a potential
legacy issue.

I recall at least Apple and Windows TCP stacks still need 3+ DUPACKs
(!= a SACK covering 3+ packets) to trigger fast recovery. Will we have
an issue there interacting w/ these stacks?

>
> neal

[PATCH net-next 5/8] tcp: new helper tcp_timeout_mark_lost

2018-05-16 Thread Yuchung Cheng

Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
lost upon RTO.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 net/ipv4/tcp_input.c | 50 +---
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6fb0a28977a0..af32accda2a9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,18 +1917,43 @@ static inline void tcp_init_undo(struct tcp_sock *tp)
tp->undo_retrans = tp->retrans_out ? : -1;
 }
 
-/* Enter Loss state. If we detect SACK reneging, forget all SACK information
+/* If we detect SACK reneging, forget all SACK information
  * and reset tags completely, otherwise preserve SACKs. If receiver
  * dropped its ofo queue, we will know this due to reneging detection.
  */
+static void tcp_timeout_mark_lost(struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct sk_buff *skb;
+   bool is_reneg;  /* is receiver reneging on SACKs? */
+
+   skb = tcp_rtx_queue_head(sk);
+   is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
+   if (is_reneg) {
+   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
+   tp->sacked_out = 0;
+   /* Mark SACK reneging until we recover from this loss event. */
+   tp->is_sack_reneg = 1;
+   } else if (tcp_is_reno(tp)) {
+   tcp_reset_reno_sack(tp);
+   }
+
+   skb_rbtree_walk_from(skb) {
+   if (is_reneg)
+   TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
+   tcp_mark_skb_lost(sk, skb);
+   }
+   tcp_verify_left_out(tp);
+   tcp_clear_all_retrans_hints(tp);
+}
+
+/* Enter Loss state. */
 void tcp_enter_loss(struct sock *sk)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
-   struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
-   bool is_reneg;  /* is receiver reneging on SACKs? */
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1944,24 +1969,7 @@ void tcp_enter_loss(struct sock *sk)
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-   if (tcp_is_reno(tp))
-   tcp_reset_reno_sack(tp);
-
-   skb = tcp_rtx_queue_head(sk);
-   is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
-   if (is_reneg) {
-   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
-   tp->sacked_out = 0;
-   /* Mark SACK reneging until we recover from this loss event. */
-   tp->is_sack_reneg = 1;
-   }
-   skb_rbtree_walk_from(skb) {
-   if (is_reneg)
-   TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
-   tcp_mark_skb_lost(sk, skb);
-   }
-   tcp_verify_left_out(tp);
-   tcp_clear_all_retrans_hints(tp);
+   tcp_timeout_mark_lost(sk);
 
/* Timeout in disordered state after receiving substantial DUPACKs
 * suggests that the degree of reordering is over-estimated.
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 1/8] tcp: support DUPACK threshold in RACK

2018-05-16 Thread Yuchung Cheng

This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.

When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.

The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.

Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.

With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.

There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).

Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 Documentation/networking/ip-sysctl.txt |  1 +
 include/net/tcp.h  |  1 +
 net/ipv4/tcp_recovery.c| 40 +-
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 59afc9a10b4f..13bbac50dc8b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -451,6 +451,7 @@ tcp_recovery - INTEGER
RACK: 0x1 enables the RACK loss detection for fast detection of lost
  retransmissions and tail drops.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
+   RACK: 0x4 disables RACK's DUPACK threshold heuristic
 
Default: 0x1
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3b1d617b0110..85000c85ddcd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -245,6 +245,7 @@ extern long sysctl_tcp_mem[3];
 
 #define TCP_RACK_LOSS_DETECTION  0x1 /* Use RACK to detect losses */
 #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
+#define TCP_RACK_NO_DUPTHRESH0x4 /* Do not use DUPACK threshold in RACK */
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 3a81720ac0c4..1c1bdf12a96f 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -21,6 +21,32 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, 
u32 seq2)
return t1 > t2 || (t1 == t2 && after(seq1, seq2));
 }
 
+u32 tcp_rack_reo_wnd(const struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if (!tp->rack.reord) {
+   /* If reordering has not been observed, be aggressive during
+* the recovery or starting the recovery by DUPACK threshold.
+*/
+   if (inet_csk(sk)->icsk_ca_state >= TCP_CA_Recovery)
+   return 0;
+
+   if (tp->sacked_out >= tp->reordering &&
+   !(sock_net(sk)->ipv4.sysctl_tcp_recovery & 
TCP_RACK_NO_DUPTHRESH))
+   return 0;
+   }
+
+   /* To be more reordering resilient, allow min_rtt/4 settling delay.
+* Use min_rtt instead of the smoothed RTT because reordering is
+* often a path property and less related to queuing or delayed ACKs.
+* Upon receiving DSACKs, linearly increase the window up to the
+* smoothed RTT.
+*/
+   return min((tcp_min_rtt(tp) >> 2) * tp->rack.reo_wnd_steps,
+  tp->srtt_us >> 3

[PATCH net-next 4/8] tcp: account lost retransmit after timeout

2018-05-16 Thread Yuchung Cheng

The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).

The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.

This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 include/net/tcp.h   |  1 +
 net/ipv4/tcp_input.c| 18 +++---
 net/ipv4/tcp_recovery.c |  4 ++--
 3 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d7f81325bee5..402484ed9b57 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1878,6 +1878,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
+void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 076206873e3e..6fb0a28977a0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1929,7 +1929,6 @@ void tcp_enter_loss(struct sock *sk)
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
-   bool mark_lost;
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1945,9 +1944,6 @@ void tcp_enter_loss(struct sock *sk)
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-   tp->retrans_out = 0;
-   tp->lost_out = 0;
-
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp);
 
@@ -1959,21 +1955,13 @@ void tcp_enter_loss(struct sock *sk)
/* Mark SACK reneging until we recover from this loss event. */
tp->is_sack_reneg = 1;
}
-   tcp_clear_all_retrans_hints(tp);
-
skb_rbtree_walk_from(skb) {
-   mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
-is_reneg);
-   if (mark_lost)
-   tcp_sum_lost(tp, skb);
-   TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
-   if (mark_lost) {
+   if (is_reneg)
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
-   TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
-   tp->lost_out += tcp_skb_pcount(skb);
-   }
+   tcp_mark_skb_lost(sk, skb);
}
tcp_verify_left_out(tp);
+   tcp_clear_all_retrans_hints(tp);
 
/* Timeout in disordered state after receiving substantial DUPACKs
 * suggests that the degree of reordering is over-estimated.
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 299b0e38aa9a..b2f9be388bf3 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -2,7 +2,7 @@
 #include 
 #include 
 
-static void tcp_rack_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
+void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -95,7 +95,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 
*reo_timeout)
remaining = tp->rack.rtt_us + reo_wnd -
tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
if (remaining <= 0) {
-   tcp_rack_mark_skb_lost(sk, skb);
+   tcp_mark_skb_lost(sk, skb);
list_del_init(>tcp_tsorted_anchor);
} else {
/* Record maximum wait time */
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 8/8] tcp: don't mark recently sent packets lost on RTO

2018-05-16 Thread Yuchung Cheng

An RTO event indicates the head has not been acked for a long time
after its last (re)transmission. But the other packets are not
necessarily lost if they have been only sent recently (for example
due to application limit). This patch would prohibit marking packets
sent within an RTT to be lost on RTO event, using similar logic in
TCP RACK detection.

Normally the head (SND.UNA) would be marked lost since RTO should
fire strictly after the head was sent. An exception is when the
most recent RACK RTT measurement is larger than the (previous)
RTO. To address this exception the head is always marked lost.

Congestion control interaction: since we may not mark every packet
lost, the congestion window may be more than 1 (inflight plus 1).
But only one packet will be retransmitted after RTO, since
tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
connection still performs slow start from one packet (with Cubic
congestion control).

This commit was tested in an A/B test with Google web servers,
and showed a reduction of 2% in (spurious) retransmits post
timeout (SlowStartRetrans), and correspondingly reduced DSACKs
(DSACKIgnoredOld) by 7%.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 net/ipv4/tcp_input.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ba8a8e3464aa..0bf032839548 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1929,11 +1929,11 @@ static bool tcp_is_rack(const struct sock *sk)
 static void tcp_timeout_mark_lost(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct sk_buff *skb;
+   struct sk_buff *skb, *head;
bool is_reneg;  /* is receiver reneging on SACKs? */
 
-   skb = tcp_rtx_queue_head(sk);
-   is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
+   head = tcp_rtx_queue_head(sk);
+   is_reneg = head && (TCP_SKB_CB(head)->sacked & TCPCB_SACKED_ACKED);
if (is_reneg) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
tp->sacked_out = 0;
@@ -1943,9 +1943,13 @@ static void tcp_timeout_mark_lost(struct sock *sk)
tcp_reset_reno_sack(tp);
}
 
+   skb = head;
skb_rbtree_walk_from(skb) {
if (is_reneg)
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
+   else if (tcp_is_rack(sk) && skb != head &&
+tcp_rack_skb_timeout(tp, skb, 0) > 0)
+   continue; /* Don't mark recently sent ones lost yet */
tcp_mark_skb_lost(sk, skb);
}
tcp_verify_left_out(tp);
@@ -1972,7 +1976,7 @@ void tcp_enter_loss(struct sock *sk)
tcp_ca_event(sk, CA_EVENT_LOSS);
tcp_init_undo(tp);
}
-   tp->snd_cwnd   = 1;
+   tp->snd_cwnd   = tcp_packets_in_flight(tp) + 1;
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 7/8] tcp: new helper tcp_rack_skb_timeout

2018-05-16 Thread Yuchung Cheng

Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 include/net/tcp.h   |  2 ++
 net/ipv4/tcp_input.c| 10 +-
 net/ipv4/tcp_recovery.c |  9 +++--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 402484ed9b57..b46d0f9adbdb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1880,6 +1880,8 @@ void tcp_init(void);
 /* tcp_recovery.c */
 void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
+extern s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb,
+   u32 reo_wnd);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 u64 xmit_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1ccc97b368c7..ba8a8e3464aa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,6 +1917,11 @@ static inline void tcp_init_undo(struct tcp_sock *tp)
tp->undo_retrans = tp->retrans_out ? : -1;
 }
 
+static bool tcp_is_rack(const struct sock *sk)
+{
+   return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
+}
+
 /* If we detect SACK reneging, forget all SACK information
  * and reset tags completely, otherwise preserve SACKs. If receiver
  * dropped its ofo queue, we will know this due to reneging detection.
@@ -2031,11 +2036,6 @@ static inline int tcp_dupack_heuristics(const struct 
tcp_sock *tp)
return tp->sacked_out + 1;
 }
 
-static bool tcp_is_rack(const struct sock *sk)
-{
-   return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
-}
-
 /* Linux NewReno/SACK/ECN state machine.
  * --
  *
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index b2f9be388bf3..30cbfb69b1de 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -47,6 +47,12 @@ u32 tcp_rack_reo_wnd(const struct sock *sk)
   tp->srtt_us >> 3);
 }
 
+s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, u32 reo_wnd)
+{
+   return tp->rack.rtt_us + reo_wnd -
+  tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+}
+
 /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01):
  *
  * Marks a packet lost, if some packet sent later has been (s)acked.
@@ -92,8 +98,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 
*reo_timeout)
/* A packet is lost if it has not been s/acked beyond
 * the recent RTT plus the reordering window.
 */
-   remaining = tp->rack.rtt_us + reo_wnd -
-   tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+   remaining = tcp_rack_skb_timeout(tp, skb, reo_wnd);
if (remaining <= 0) {
tcp_mark_skb_lost(sk, skb);
list_del_init(>tcp_tsorted_anchor);
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 6/8] tcp: separate loss marking and state update on RTO

2018-05-16 Thread Yuchung Cheng

Previously when TCP times out, it first updates cwnd and ssthresh,
marks packets lost, and then updates congestion state again. This
was fine because everything not yet delivered is marked lost,
so the inflight is always 0 and cwnd can be safely set to 1 to
retransmit one packet on timeout.

But the inflight may not always be 0 on timeout if TCP changes to
mark packets lost based on packet sent time. Therefore we must
first mark the packet lost, then set the cwnd based on the
(updated) inflight.

This is not a pure refactor. Congestion control may potentially
break if it uses (not yet updated) inflight to compute ssthresh.
Fortunately all existing congestion control modules does not do that.
Also it changes the inflight when CA_LOSS_EVENT is called, and only
westwood processes such an event but does not use inflight.

This change has two other minor side benefits:
1) consistent with Fast Recovery s.t. the inflight is updated
   first before tcp_enter_recovery flips state to CA_Recovery.

2) avoid intertwining loss marking with state update, making the
   code more readable.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af32accda2a9..1ccc97b368c7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1955,6 +1955,8 @@ void tcp_enter_loss(struct sock *sk)
struct net *net = sock_net(sk);
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
 
+   tcp_timeout_mark_lost(sk);
+
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
!after(tp->high_seq, tp->snd_una) ||
@@ -1969,8 +1971,6 @@ void tcp_enter_loss(struct sock *sk)
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-   tcp_timeout_mark_lost(sk);
-
/* Timeout in disordered state after receiving substantial DUPACKs
 * suggests that the degree of reordering is over-estimated.
 */
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 2/8] tcp: disable RFC6675 loss detection

2018-05-16 Thread Yuchung Cheng

This patch disables RFC6675 loss detection and make sysctl
net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
(1) or RFC6675 (0).

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 Documentation/networking/ip-sysctl.txt |  3 ++-
 net/ipv4/tcp_input.c   | 12 
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 13bbac50dc8b..ea304a23c8d7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -449,7 +449,8 @@ tcp_recovery - INTEGER
features.
 
RACK: 0x1 enables the RACK loss detection for fast detection of lost
- retransmissions and tail drops.
+ retransmissions and tail drops. It also subsumes and disables
+ RFC6675 recovery for SACK connections.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
RACK: 0x4 disables RACK's DUPACK threshold heuristic
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b188e0d75edd..ccbe04f80040 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2035,6 +2035,11 @@ static inline int tcp_dupack_heuristics(const struct 
tcp_sock *tp)
return tp->sacked_out + 1;
 }
 
+static bool tcp_is_rack(const struct sock *sk)
+{
+   return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
+}
+
 /* Linux NewReno/SACK/ECN state machine.
  * --
  *
@@ -2141,7 +2146,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
return true;
 
/* Not-A-Trick#2 : Classic rule... */
-   if (tcp_dupack_heuristics(tp) > tp->reordering)
+   if (!tcp_is_rack(sk) && tcp_dupack_heuristics(tp) > tp->reordering)
return true;
 
return false;
@@ -2722,8 +2727,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int 
*ack_flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   /* Use RACK to detect loss */
-   if (sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) {
+   if (tcp_is_rack(sk)) {
u32 prior_retrans = tp->retrans_out;
 
tcp_rack_mark_lost(sk);
@@ -2862,7 +2866,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
u32 prior_snd_una,
fast_rexmit = 1;
}
 
-   if (do_lost)
+   if (!tcp_is_rack(sk) && do_lost)
tcp_update_scoreboard(sk, fast_rexmit);
*rexmit = REXMIT_LOST;
 }
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 3/8] tcp: simpler NewReno implementation

2018-05-16 Thread Yuchung Cheng

This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.

Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
---
 include/net/tcp.h   |  1 +
 net/ipv4/tcp_input.c| 19 +++
 net/ipv4/tcp_recovery.c | 27 +++
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85000c85ddcd..d7f81325bee5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1878,6 +1878,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
+void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 u64 xmit_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ccbe04f80040..076206873e3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2223,9 +2223,7 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_is_reno(tp)) {
-   tcp_mark_head_lost(sk, 1, 1);
-   } else {
+   if (tcp_is_sack(tp)) {
int sacked_upto = tp->sacked_out - tp->reordering;
if (sacked_upto >= 0)
tcp_mark_head_lost(sk, sacked_upto, 0);
@@ -2723,11 +2721,16 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 
prior_snd_una)
return false;
 }
 
-static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
+static void tcp_identify_packet_loss(struct sock *sk, int *ack_flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_is_rack(sk)) {
+   if (tcp_rtx_queue_empty(sk))
+   return;
+
+   if (unlikely(tcp_is_reno(tp))) {
+   tcp_newreno_mark_lost(sk, *ack_flag & FLAG_SND_UNA_ADVANCED);
+   } else if (tcp_is_rack(sk)) {
u32 prior_retrans = tp->retrans_out;
 
tcp_rack_mark_lost(sk);
@@ -2823,11 +2826,11 @@ static void tcp_fastretrans_alert(struct sock *sk, 
const u32 prior_snd_una,
tcp_try_keep_open(sk);
return;
}
-   tcp_rack_identify_loss(sk, ack_flag);
+   tcp_identify_packet_loss(sk, ack_flag);
break;
case TCP_CA_Loss:
tcp_process_loss(sk, flag, is_dupack, rexmit);
-   tcp_rack_identify_loss(sk, ack_flag);
+   tcp_identify_packet_loss(sk, ack_flag);
if (!(icsk->icsk_ca_state == TCP_CA_Open ||
  (*ack_flag & FLAG_LOST_RETRANS)))
return;
@@ -2844,7 +2847,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
u32 prior_snd_una,
if (icsk->icsk_ca_state <= TCP_CA_Disorder)
tcp_try_undo_dsack(sk);
 
-   tcp_rack_identify_loss(sk, ack_flag);
+   tcp_identify_packet_loss(sk, ack_flag);
if (!tcp_time_to_recover(sk, flag)) {
tcp_try_to_open(sk, flag);
return;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 1c1bdf12a96f..299b0e38aa9a 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -216,3 +216,30 @@ void tcp_rack_update_reo_wnd(struct sock *sk, struct 
rate_sample *rs)
tp->rack.reo_wnd_steps = 1;
}
 }
+
+/* RFC6582 NewReno recovery for non-SACK connection. It simply retransmits
+ * the next unacked packet upon receiving
+ * a) three or more DUPACKs to start the fast recovery
+ * b) an ACK acknowledging new data during the fast recovery.
+ */
+void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced)
+{
+   const u8 state = inet_csk(sk)->icsk_ca_state;
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if ((state < TCP_CA_Recovery && tp->sacked_out >= tp->reordering) ||
+   (state == TCP_CA_Recovery && snd_una_advanced)) {
+   struct sk_buff *skb = tcp_rtx_queue_head(sk);
+   u32 mss;
+
+   if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+   return;
+
+   mss = tcp_skb_mss(skb);
+   if (tcp_skb_pcount(skb) > 1 && skb->len > mss)
+   tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE,

[PATCH net-next 0/8] tcp: default RACK loss recovery

2018-05-16 Thread Yuchung Cheng

This patch set implements the features correspond to the
draft-ietf-tcpm-rack-03 version of the RACK draft.
https://datatracker.ietf.org/meeting/101/materials/slides-101-tcpm-update-on-tcp-rack-00

1. SACK: implement equivalent DUPACK threshold heuristic in RACK to
   replace existing RFC6675 recovery (tcp_mark_head_lost).

2. Non-SACK: simplify RFC6582 NewReno implementation

3. RTO: apply RACK's time-based approach to avoid spuriouly
   marking very recently sent packets lost.

4. with (1)(2)(3), make RACK the exclusive fast recovery mechanism to
   mark losses based on time on S/ACK. Tail loss probe and F-RTO remain
   enabled by default as complementary mechanisms to send probes in
   CA_Open and CA_Loss states. The probes would solicit S/ACKs to trigger
   RACK time-based loss detection.

All Google web and internal servers have been running RACK-only mode
(4) for a while now. a/b experiments indicate RACK/TLP on average
reduces recovery latency by 10% compared to RFC6675. RFC6675
is default-off now but can be enabled by disabling RACK (sysctl
net.ipv4.tcp_recovery=0) for unseen issues.

Yuchung Cheng (8):
  tcp: support DUPACK threshold in RACK
  tcp: disable RFC6675 loss detection
  tcp: simpler NewReno implementation
  tcp: account lost retransmit after timeout
  tcp: new helper tcp_timeout_mark_lost
  tcp: separate loss marking and state update on RTO
  tcp: new helper tcp_rack_skb_timeout
  tcp: don't mark recently sent packets lost on RTO

 Documentation/networking/ip-sysctl.txt |  4 +-
 include/net/tcp.h  |  5 ++
 net/ipv4/tcp_input.c   | 99 ++
 net/ipv4/tcp_recovery.c| 80 -
 4 files changed, 124 insertions(+), 64 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net] tcp: ignore Fast Open on repair mode

2018-04-25 Thread Yuchung Cheng

The TCP repair sequence of operation is to first set the socket in
repair mode, then inject the TCP stats into the socket with repair
socket options, then call connect() to re-activate the socket. The
connect syscall simply returns and set state to ESTABLISHED
mode. As a result Fast Open is meaningless for TCP repair.

However allowing sendto() system call with MSG_FASTOPEN flag half-way
during the repair operation could unexpectedly cause data to be
sent, before the operation finishes changing the internal TCP stats
(e.g. MSS).  This in turn triggers TCP warnings on inconsistent
packet accounting.

The fix is to simply disallow Fast Open operation once the socket
is in the repair mode.

Reported-by: syzbot <syzkal...@googlegroups.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9ce1c726185e..4b18ad41d4df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1204,7 +1204,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
uarg->zerocopy = 0;
}
 
-   if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect)) {
+   if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
+   !tp->repair) {
err = tcp_sendmsg_fastopen(sk, msg, _syn, size);
if (err == -EINPROGRESS && copied_syn > 0)
goto out;
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH net-next 0/4] tracking TCP data delivery and ECN stats

2018-04-19 Thread Yuchung Cheng

On Thu, Apr 19, 2018 at 10:07 AM, David Miller <da...@davemloft.net> wrote:
>
> From: Yuchung Cheng <ych...@google.com>
> Date: Tue, 17 Apr 2018 23:18:45 -0700
>
> > This patch series improve tracking the data delivery status
> >   1. minor improvement on SYN data
> >   2. accounting bytes delivered with CE marks
> >   3. exporting the delivery stats to applications
> >
> > s.t. users can get better sense of TCP performance at per host,
> > per connection, and even per application message level.
>
> Definitely useful, so series applied.
Thanks.

TCP socket is getting bigger and bigger :-( I am cooking a patch set
to simplify loss recovery that should help conserving the space.

>
> But it is not lost upon me that slowly over time tcp sockets are
> bloating quite a bit...

[PATCH net-next 4/4] tcp: export packets delivery info

2018-04-18 Thread Yuchung Cheng

Export data delivered and delivered with CE marks to
1) SNMP TCPDelivered and TCPDeliveredCE
2) getsockopt(TCP_INFO)
3) Timestamping API SOF_TIMESTAMPING_OPT_STATS

Note that for SCM_TSTAMP_ACK, the delivery info in
SOF_TIMESTAMPING_OPT_STATS is reported before the info
was fully updated on the ACK.

These stats help application monitor TCP delivery and ECN status
on per host, per connection, even per message level.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 include/uapi/linux/snmp.h | 2 ++
 include/uapi/linux/tcp.h  | 5 +
 net/ipv4/proc.c   | 2 ++
 net/ipv4/tcp.c| 7 ++-
 net/ipv4/tcp_input.c  | 6 +-
 5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 33a70ece462f..d02e859301ff 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -276,6 +276,8 @@ enum
LINUX_MIB_TCPKEEPALIVE, /* TCPKeepAlive */
LINUX_MIB_TCPMTUPFAIL,  /* TCPMTUPFail */
LINUX_MIB_TCPMTUPSUCCESS,   /* TCPMTUPSuccess */
+   LINUX_MIB_TCPDELIVERED, /* TCPDelivered */
+   LINUX_MIB_TCPDELIVEREDCE,   /* TCPDeliveredCE */
__LINUX_MIB_MAX
 };
 
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 560374c978f9..379b08700a54 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -224,6 +224,9 @@ struct tcp_info {
__u64   tcpi_busy_time;  /* Time (usec) busy sending data */
__u64   tcpi_rwnd_limited;   /* Time (usec) limited by receive window */
__u64   tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
+
+   __u32   tcpi_delivered;
+   __u32   tcpi_delivered_ce;
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -244,6 +247,8 @@ enum {
TCP_NLA_SNDQ_SIZE,  /* Data (bytes) pending in send queue */
TCP_NLA_CA_STATE,   /* ca_state of socket */
TCP_NLA_SND_SSTHRESH,   /* Slow start size threshold */
+   TCP_NLA_DELIVERED,  /* Data pkts delivered incl. out-of-order */
+   TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
 
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a058de677e94..261b71d0ccc5 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -296,6 +296,8 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPKeepAlive", LINUX_MIB_TCPKEEPALIVE),
SNMP_MIB_ITEM("TCPMTUPFail", LINUX_MIB_TCPMTUPFAIL),
SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS),
+   SNMP_MIB_ITEM("TCPDelivered", LINUX_MIB_TCPDELIVERED),
+   SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE),
SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5a5ce6da4792..4022073b0aee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3167,6 +3167,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
rate64 = tcp_compute_delivery_rate(tp);
if (rate64)
info->tcpi_delivery_rate = rate64;
+   info->tcpi_delivered = tp->delivered;
+   info->tcpi_delivered_ce = tp->delivered_ce;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3180,7 +3182,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
u32 rate;
 
stats = alloc_skb(7 * nla_total_size_64bit(sizeof(u64)) +
- 5 * nla_total_size(sizeof(u32)) +
+ 7 * nla_total_size(sizeof(u32)) +
  3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
if (!stats)
return NULL;
@@ -3211,9 +3213,12 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
nla_put_u8(stats, TCP_NLA_RECUR_RETRANS, 
inet_csk(sk)->icsk_retransmits);
nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, 
!!tp->rate_app_limited);
nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, tp->snd_ssthresh);
+   nla_put_u32(stats, TCP_NLA_DELIVERED, tp->delivered);
+   nla_put_u32(stats, TCP_NLA_DELIVERED_CE, tp->delivered_ce);
 
nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una);
nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state);
+
return stats;
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b3bff9c20606..0396fb919b5d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3499,12 +3499,16 @@ static void tcp_xmit_recovery(struct sock *sk, int 
rexmit)
 /* Returns the number of packets newly acked or sacked by the current ACK */
 static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)

[PATCH net-next 0/4] tracking TCP data delivery and ECN stats

2018-04-18 Thread Yuchung Cheng

This patch series improve tracking the data delivery status
  1. minor improvement on SYN data
  2. accounting bytes delivered with CE marks
  3. exporting the delivery stats to applications

s.t. users can get better sense of TCP performance at per host,
per connection, and even per application message level.

Yuchung Cheng (4):
  tcp: better delivery accounting for SYN-ACK and SYN-data
  tcp: new helper to calculate newly delivered
  tcp: track total bytes delivered with ECN CE marks
  tcp: export packets delivery info

 include/linux/tcp.h   |  1 +
 include/uapi/linux/snmp.h |  2 ++
 include/uapi/linux/tcp.h  |  5 +
 net/ipv4/proc.c   |  2 ++
 net/ipv4/tcp.c|  8 +++-
 net/ipv4/tcp_input.c  | 33 -
 6 files changed, 45 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

[PATCH net-next 3/4] tcp: track total bytes delivered with ECN CE marks

2018-04-18 Thread Yuchung Cheng

Introduce a new delivered_ce stat in tcp socket to estimate
number of packets being marked with CE bits. The estimation is
done via ACKs with ECE bit. Depending on the actual receiver
behavior, the estimation could have biases.

Since the TCP sender can't really see the CE bit in the data path,
so the sender is technically counting packets marked delivered with
the "ECE / ECN-Echo" flag set.

With RFC3168 ECN, because the ECE bit is sticky, this count can
drastically overestimate the nummber of CE-marked data packets

With DCTCP-style ECN this should be reasonably precise unless there
is loss in the ACK path, in which case it's not precise.

With AccECN proposal this can be made still more precise, even in
the case some degree of ACK loss.

However this is sender's best estimate of CE information.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 include/linux/tcp.h  | 1 +
 net/ipv4/tcp.c   | 1 +
 net/ipv4/tcp_input.c | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c54986f97..20585d5c4e1c 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -281,6 +281,7 @@ struct tcp_sock {
 * receiver in Recovery. */
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
+   u32 delivered_ce;   /* Like the above but only ECE marked packets */
u32 lost;   /* Total data packets lost incl. rexmits */
u32 app_limited;/* limited until "delivered" reaches this val */
u64 first_tx_mstamp;  /* start of window send phase */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 438fbca96cd3..5a5ce6da4792 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2559,6 +2559,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_cnt = 0;
tp->window_clamp = 0;
+   tp->delivered_ce = 0;
tcp_set_ca_state(sk, TCP_CA_Open);
tp->is_sack_reneg = 0;
tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 01cce28f90ca..b3bff9c20606 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3503,6 +3503,8 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 
prior_delivered, int flag)
u32 delivered;
 
delivered = tp->delivered - prior_delivered;
+   if (flag & FLAG_ECE)
+   tp->delivered_ce += delivered;
return delivered;
 }
 
-- 
2.17.0.484.g0c8726318c-goog

[PATCH net-next 1/4] tcp: better delivery accounting for SYN-ACK and SYN-data

2018-04-18 Thread Yuchung Cheng

the tcp_sock:delivered has inconsistent accounting for SYN and FIN.
1. it counts pure FIN
2. it counts pure SYN
3. it counts SYN-data twice
4. it does not count SYN-ACK

For congestion control perspective it does not matter much as C.C. only
cares about the difference not the aboslute value. But the next patch
would export this field to user-space so it's better to report the absolute
value w/o these caveats.

This patch counts SYN, SYN-ACK, or SYN-data delivery once always in
the "delivered" field.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f97d80..2499248d4a67 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5567,9 +5567,12 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, 
struct sk_buff *synack,
return true;
}
tp->syn_data_acked = tp->syn_data;
-   if (tp->syn_data_acked)
-   NET_INC_STATS(sock_net(sk),
-   LINUX_MIB_TCPFASTOPENACTIVE);
+   if (tp->syn_data_acked) {
+   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE);
+   /* SYN-data is counted as two separate packets in tcp_ack() */
+   if (tp->delivered > 1)
+   --tp->delivered;
+   }
 
tcp_fastopen_add_skb(sk, synack);
 
@@ -5901,6 +5904,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
}
switch (sk->sk_state) {
case TCP_SYN_RECV:
+   tp->delivered++; /* SYN-ACK delivery isn't tracked in tcp_ack */
if (!tp->srtt_us)
tcp_synack_rtt_meas(sk, req);
 
-- 
2.17.0.484.g0c8726318c-goog

[PATCH net-next 2/4] tcp: new helper to calculate newly delivered

2018-04-18 Thread Yuchung Cheng

Add new helper tcp_newly_delivered() to prepare the ECN accounting change.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2499248d4a67..01cce28f90ca 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3496,6 +3496,16 @@ static void tcp_xmit_recovery(struct sock *sk, int 
rexmit)
tcp_xmit_retransmit_queue(sk);
 }
 
+/* Returns the number of packets newly acked or sacked by the current ACK */
+static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   u32 delivered;
+
+   delivered = tp->delivered - prior_delivered;
+   return delivered;
+}
+
 /* This routine deals with incoming acks, but not outgoing ones. */
 static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 {
@@ -3619,7 +3629,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
sk_dst_confirm(sk);
 
-   delivered = tp->delivered - delivered;  /* freshly ACKed or SACKed */
+   delivered = tcp_newly_delivered(sk, delivered, flag);
lost = tp->lost - lost; /* freshly marked lost */
rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
@@ -3629,9 +3639,11 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
 
 no_queue:
/* If data was DSACKed, see if we can undo a cwnd reduction. */
-   if (flag & FLAG_DSACKING_ACK)
+   if (flag & FLAG_DSACKING_ACK) {
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, ,
  );
+   tcp_newly_delivered(sk, delivered, flag);
+   }
/* If this ack opens up a zero window, clear backoff.  It was
 * being used to time the probes, and is probably far higher than
 * it needs to be for normal retransmission.
@@ -3655,6 +3667,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
_state);
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, ,
  );
+   tcp_newly_delivered(sk, delivered, flag);
tcp_xmit_recovery(sk, rexmit);
}
 
-- 
2.17.0.484.g0c8726318c-goog

Re: TCP one-by-one acking - RFC interpretation question

2018-04-12 Thread Yuchung Cheng

On Wed, Apr 11, 2018 at 5:06 AM, Michal Kubecek  wrote:
> On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
>> There is something else I don't understand, though. In the case of
>> acking previously sacked and never retransmitted segment,
>> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
>> using
>>
>> if (sack->first_sackt.v64) {
>> sack_rtt_us = skb_mstamp_us_delta(,
>> >first_sackt);
>> ca_rtt_us = skb_mstamp_us_delta(,
>> >last_sackt);
>> }
>>
>> (in 4.4; mainline code replaces  with tp->tcp_mstamp). If I read the
>> code correctly, both sack->first_sackt and sack->last_sackt contain
>> timestamps of initial segment transmission. This would mean we use the
>> time difference between the initial transmission and now, i.e. including
>> the RTO of the lost packet).
>>
>> IMHO we should take the actual round trip time instead, i.e. the
>> difference between the original transmission and the time the packet
>> sacked (first time). It seems we have been doing this before commit
>> 31231a8a8730 ("tcp: improve RTT from SACK for CC").
>
> Sorry for the noise, this was my misunderstanding, the first_sackt and
> last_sackt values are only taken from segments newly sacked by ack
> received right now, not those which were already sacked before.
>
> The actual problem and unrealistic RTT measurements come from another
> RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
> section 4 rule for ordering of SACK blocks. Rather than sending SACK
> blocks three most recently received out-of-order blocks, it simply sends
> first three ordered by sequence numbers. In the earlier example (odd
> packets were received, even lost)
>
>ACK SAK SAK SAK
> +---+---+---+---+---+---+---+---+---+
> |   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
> +---+---+---+---+---+---+---+---+---+
>   34273   35701   37129   38557   39985   41413   42841   44269   45697   
> 47125
>
> it responds to retransmitted segment 2 by
>
>   1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>   2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125
>
> This new SACK block 45697-47125 has not been retransmitted and as it
> wasn't sacked before, it is considered newly sacked. Therefore it gets
> processed and its deemed RTT (time since its original transmit time)
> "poisons" the RTT calculation, leading to RTO spiraling up.
>
> Thus if we want to work around the NAS behaviour, we would need to
> recognize such new SACK block as "not really new" and ignore it for
> first_sackt/last_sackt. I'm not sure if it's possible without
> misinterpreting actually delayed out of order packets. Of course, it is
> not clear if it's worth the effort to work around so severely broken TCP
> implementations (two obvious RFC violations, even if we don't count the
> one-by-one acking).
Right. Not much we (sender) can do if the receiver is not reporting
the delivery status correctly. This also negatively impacts TCP
congestion control (Cubic, Reno, BBR, CDG etc) because we've changed
it to increase/decrease cwnd based on both inorder and out-of-order
delivery.

We're close to publish our internal packetdrill tests. Hopefully they
can be used to test these poor implementations.

>
> Michal Kubecek

Re: [PATCH net] tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets

2018-04-12 Thread Yuchung Cheng

On Wed, Apr 11, 2018 at 2:36 PM, Eric Dumazet <eduma...@google.com> wrote:
>
> syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]
>
> I believe this was caused by a TCP_MD5SIG being set on live
> flow.
>
> This is highly unexpected, since TCP option space is limited.
>
> For instance, presence of TCP MD5 option automatically disables
> TCP TimeStamp option at SYN/SYNACK time, which we can not do
> once flow has been established.
>
> Really, adding/deleting an MD5 key only makes sense on sockets
> in CLOSE or LISTEN state.
>
> [1]
> BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 
> net/ipv4/tcp_input.c:3720
> CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>  tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
>  tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
>  tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
>  tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
>  tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
>  sk_backlog_rcv include/net/sock.h:908 [inline]
>  __release_sock+0x2d6/0x680 net/core/sock.c:2271
>  release_sock+0x97/0x2a0 net/core/sock.c:2786
>  tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
>  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg net/socket.c:640 [inline]
>  SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
>  SyS_sendto+0x8a/0xb0 net/socket.c:1715
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x448fe9
> RSP: 002b:7fd472c64d38 EFLAGS: 0216 ORIG_RAX: 002c
> RAX: ffda RBX: 006e5a30 RCX: 00448fe9
> RDX: 029f RSI: 20a88f88 RDI: 0004
> RBP: 006e5a34 R08: 20e68000 R09: 0010
> R10: 27fd R11: 0216 R12: 
> R13: 7fff074899ef R14: 7fd472c659c0 R15: 0009
>
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
>  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
>  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
>  slab_post_alloc_hook mm/slab.h:445 [inline]
>  slab_alloc_node mm/slub.c:2737 [inline]
>  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
>  __kmalloc_reserve net/core/skbuff.c:138 [inline]
>  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
>  alloc_skb include/linux/skbuff.h:984 [inline]
>  tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
>  __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
>  tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
>  tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
>  tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
>  sk_backlog_rcv include/net/sock.h:908 [inline]
>  __release_sock+0x2d6/0x680 net/core/sock.c:2271
>  release_sock+0x97/0x2a0 net/core/sock.c:2786
>  tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
>  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg net/socket.c:640 [inline]
>  SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
>  SyS_sendto+0x8a/0xb0 net/socket.c:1715
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>
> Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: syzbot <syzkal...@googlegroups.com>
> ---
Acked-by: Yuchung Cheng <ych...@google.com>

Thanks for the fix!
>  net/ipv4/tcp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> bccc4c2700870b8c7ff592a6bd27acebd9bc6471..4fa3f812b9ff8954a9b6a018c648ff12ab995721
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2813,8 +2813,10 @@ static int do_tcp_setsockopt(struct sock *sk, int 
> level,
>  #ifdef CONFIG_TCP_MD5SIG
> case TCP_MD5SIG:
> case TCP_MD5SIG_EXT:
> -   /* Read the IP->Key mappings from userspace */
> -   err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
> +   if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> +   err = tp->af_specific->md5_parse(sk, optname, optval, 
> optlen);
> +   else
> +   err = -EINVAL;
> break;
>  #endif
> case TCP_USER_TIMEOUT:
> --
> 2.17.0.484.g0c8726318c-goog
>

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-04-04 Thread Yuchung Cheng

On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell <ncardw...@google.com> wrote:
> n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ych...@google.com> wrote:
>> Agreed. That's a good point. And I would much preferred to rename that
>> to FLAG_ORIG_PROGRESS (w/ updated comment).
>
>> so I think we're in agreement to use existing patch w/ the new name
>> FLAG_ORIG_PROGRESS
>
> Yes, SGTM.
>
> I guess this "prevent bogus FRTO undos" patch would go to "net" branch and
> the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
> branch?
huh? why not one patch ... this is getting close to patch-split paralyses.

>
> neal

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-04-04 Thread Yuchung Cheng

On Wed, Apr 4, 2018 at 9:33 AM, Neal Cardwell <ncardw...@google.com> wrote:
>
> On Wed, Apr 4, 2018 at 6:35 AM Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> wrote:
>
> > On Wed, 28 Mar 2018, Yuchung Cheng wrote:
>
> > > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > > <ilpo.jarvi...@helsinki.fi> wrote:
> > > >
> > > > If SACK is not enabled and the first cumulative ACK after the RTO
> > > > retransmission covers more than the retransmitted skb, a spurious
> > > > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > > > The reason is that any non-retransmitted segment acknowledged will
> > > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > > > no indication that it would have been delivered for real (the
> > > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > > > case so the check for that bit won't help like it does with SACK).
> > > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > > > in tcp_process_loss.
> > > >
> > > > We need to use more strict condition for non-SACK case and check
> > > > that none of the cumulatively ACKed segments were retransmitted
> > > > to prove that progress is due to original transmissions. Only then
> > > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > > > non-SACK case.
> > > >
> > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> > > > ---
> > > >  net/ipv4/tcp_input.c | 9 +
> > > >  1 file changed, 9 insertions(+)
> > > >
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index 4a26c09..c60745c 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock
> *sk, u32 prior_fack,
> > > > pkts_acked = rexmit_acked +
> newdata_acked;
> > > >
> > > > tcp_remove_reno_sacks(sk, pkts_acked);
> > > > +
> > > > +   /* If any of the cumulatively ACKed segments
> was
> > > > +* retransmitted, non-SACK case cannot
> confirm that
> > > > +* progress was due to original transmission
> due to
> > > > +* lack of TCPCB_SACKED_ACKED bits even if
> some of
> > > > +* the packets may have been never
> retransmitted.
> > > > +*/
> > > > +   if (flag & FLAG_RETRANS_DATA_ACKED)
> > > > +   flag &= ~FLAG_ORIG_SACK_ACKED;
>
> FWIW I'd vote for this version.
>
> > Of course I could put the back there but I really like the new place more
> > (which was a result of your suggestion to place the code elsewhere).
> > IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we
> > weren't successful in proving (there in tcp_clean_rtx_queue) that progress
> > was due original transmission and thus I would not want falsely indicate
> > it with that flag. And there's the non-SACK related block anyway already
> > there so it keeps the non-SACK "pollution" off from the SACK code paths.
>
> I think that's a compelling argument. In particular, in terms of long-term
> maintenance it seems risky to allow an unsound/buggy FLAG_ORIG_SACK_ACKED
> bit be returned by tcp_clean_rtx_queue(). If we return an
> incorrect/imcomplete FLAG_ORIG_SACK_ACKED bit then I worry that one day we
> will forget that for non-SACK flows that bit is incorrect/imcomplete, and
> we will add code using that bit but forgetting to check (tcp_is_sack(tp) ||
> !FLAG_RETRANS_DATA_ACKED).
Agreed. That's a good point. And I would much preferred to rename that
to FLAG_ORIG_PROGRESS (w/ updated comment).

so I think we're in agreement to use existing patch w/ the new name
FLAG_ORIG_PROGRESS

>
> > (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to
> > FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition
> > we're after regardless of SACK and less ambiguous in non-SACK case).
>
> I'm neutral on this. Not sure if the extra clarity is worth the code churn.
>
> cheers,
> neal

Re: [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-28 Thread Yuchung Cheng

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
 wrote:
> When a cumulative ACK lands to high_seq at the end of loss
> recovery and SACK is not enabled, the sender needs to avoid
> false fast retransmits (RFC6582). The avoidance mechanisms is
> implemented by remaining in the loss recovery CA state until
> one additional cumulative ACK arrives. During the operation of
> this avoidance mechanism, there is internal transient in the
> use of state variables which will always trigger a bogus undo.
Do we have to make undo in non-sack perfect? can we consider a much
simpler but imperfect fix of

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8d480542aa07..95225d9de0af 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2356,6 +2356,7 @@ static bool tcp_try_undo_recovery(struct sock *sk)
 * fast retransmits (RFC2582). SACK TCP is safe. */
if (!tcp_any_retrans_done(sk))
tp->retrans_stamp = 0;
+   tp->undo_marker = 0;
return true;
}



>
> When we enter to this transient state in tcp_try_undo_recovery,
> tcp_any_retrans_done is often (always?) false resulting in
> clearing retrans_stamp. On the next cumulative ACK,
> tcp_try_undo_recovery again executes because CA state still
> remains in the same recovery state and tcp_may_undo will always
> return true because tcp_packet_delayed has this condition:
> return !tp->retrans_stamp || ...
>
> Check if the false fast retransmit transient avoidance is in
> progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
> has advanced already on this ACK but CA state still remains
> unchanged (CA state is updated slightly later than undo is
> checked), prior_snd_una needs to be passed to tcp_packet_delayed
> (instead of tp->snd_una). Passing prior_snd_una around to
> the tcp_packet_delayed makes this change look more involved than
> it really is.
>
> The additional checks done in this change only affect non-SACK
> case, the SACK case remains the same.
>
> Signed-off-by: Ilpo Järvinen 
> ---
>  net/ipv4/tcp_input.c | 42 ++
>  1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 72ecfbb..270aa48 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct 
> tcp_sock *tp,
>  /* Nothing was retransmitted or returned timestamp is less
>   * than timestamp of the first retransmission.
>   */
> -static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
> +static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
> + const u32 prior_snd_una)
>  {
> -   return !tp->retrans_stamp ||
> -  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
> +   if (!tp->retrans_stamp) {
> +   /* Sender will be in a transient state with cleared
> +* retrans_stamp during false fast retransmit prevention
> +* mechanism
> +*/
> +   return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
> +   }
> +   return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
>  }
>
>  /* Undo procedures. */
> @@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
> bool unmark_loss)
> tp->rack.advanced = 1; /* Force RACK to re-exam losses */
>  }
>
> -static inline bool tcp_may_undo(const struct tcp_sock *tp)
> +static inline bool tcp_may_undo(const struct tcp_sock *tp,
> +   const u32 prior_snd_una)
>  {
> -   return tp->undo_marker && (!tp->undo_retrans || 
> tcp_packet_delayed(tp));
> +   return tp->undo_marker &&
> +  (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
>  }
>
>  /* People celebrate: "We love our President!" */
> -static bool tcp_try_undo_recovery(struct sock *sk)
> +static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
>
> -   if (tcp_may_undo(tp)) {
> +   if (tcp_may_undo(tp, prior_snd_una)) {
> int mib_idx;
>
> /* Happy end! We did not retransmit anything
> @@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
>  }
>
>  /* Undo during loss recovery after partial ACK or using F-RTO. */
> -static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
> +static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
> + bool frto_undo)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
>
> -   if (frto_undo || tcp_may_undo(tp)) {
> +   if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
> tcp_undo_cwnd_reduction(sk, true);
>
> DBGUNDO(sk, "partial loss");
> @@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-28 Thread Yuchung Cheng

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
 wrote:
>
> If SACK is not enabled and the first cumulative ACK after the RTO
> retransmission covers more than the retransmitted skb, a spurious
> FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> The reason is that any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case so the check for that bit won't help like it does with SACK).
> Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> in tcp_process_loss.
>
> We need to use more strict condition for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted
> to prove that progress is due to original transmissions. Only then
> keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> non-SACK case.
>
> Signed-off-by: Ilpo Järvinen 
> ---
>  net/ipv4/tcp_input.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 4a26c09..c60745c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
> pkts_acked = rexmit_acked + newdata_acked;
>
> tcp_remove_reno_sacks(sk, pkts_acked);
> +
> +   /* If any of the cumulatively ACKed segments was
> +* retransmitted, non-SACK case cannot confirm that
> +* progress was due to original transmission due to
> +* lack of TCPCB_SACKED_ACKED bits even if some of
> +* the packets may have been never retransmitted.
> +*/
> +   if (flag & FLAG_RETRANS_DATA_ACKED)
> +   flag &= ~FLAG_ORIG_SACK_ACKED;

How about keeping your excellent comment but move the fix to F-RTO
code directly so it's more clear? this way the flag remains clear that
indicates some never-retransmitted data are acked/sacked.

// pseudo code for illustration

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8d480542aa07..f7f3357de618 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2629,8 +2629,15 @@ static void tcp_process_loss(struct sock *sk,
int flag, bool is_dupack,
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
/* Step 3.b. A timeout is spurious if not all data are
 * lost, i.e., never-retransmitted data are (s)acked.
+*
+* If any of the cumulatively ACKed segments was
+* retransmitted, non-SACK case cannot confirm that
+* progress was due to original transmission due to
+* lack of TCPCB_SACKED_ACKED bits even if some of
+* the packets may have been never retransmitted.
 */
if ((flag & FLAG_ORIG_SACK_ACKED) &&
+   (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED) &&
tcp_try_undo_loss(sk, true))
return;





> } else {
> int delta;
>
> --
> 2.7.4
>

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-28 Thread Yuchung Cheng

On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ych...@google.com> wrote:
>
> On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> <ilpo.jarvi...@helsinki.fi> wrote:
> > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> >> <ilpo.jarvi...@helsinki.fi> wrote:
> >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >> >
> >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> >> <ilpo.jarvi...@helsinki.fi> wrote:
> >> >> >
> >> >> > A miscalculation for the number of acknowledged packets occurs during
> >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> >> > RTO as it may include segments that were not lost and therefore did
> >> >> > not need retransmissions in the slow start following the RTO. Then
> >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> >> > triggering a burst.
> >> >> >
> >> >> > Instead, we want to pass only the number of retransmitted segments
> >> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> >> > data segments too if the cumulative ACK covers that far).
> >> >> >
> >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> >> >> > ---
> >> >>
> >> >> My understanding is there are two problems
> >> >>
> >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> >> precise cumulatively acked count, not newly acked count?
> >> >
> >> > While I'm not entirely sure if you intented to say that my fix is broken
> >> > or not, I thought this very difference alot while making the fix and I
> >> > believe that this fix is needed because of the discontinuity at RTO
> >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >> >
> >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> >> > tp->delivered, using plain cumulative acked causes congestion control
> >> > breakage later as call to tcp_cong_control will directly use the
> >> > difference in tp->delivered.
> >> >
> >> > This boils down the exact definition of tp->delivered (the one given in
> >> > the header is not detailed enough). I guess you might have better idea
> >> > what it exactly is since one of you has added it? There are subtle things
> >> > in the defination that can make it entirely unsuitable for cc decisions.
> >> > Should those segments that we (possibly) already counted into
> >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> >> > for _second time_ or not? This fix avoids such double counting (the
> >> Where is the double counting, assuming normal DUPACK behavior?
> >>
> >> In the non-sack case:
> >>
> >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> >> incrementing tp->delivered in tcp_add_reno_sack()
> >
> > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > I've tried to point out quite many times already)...
> >
> >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> >> #dupacks) in tcp_remove_reno_sacks()
> >
> > ...and this won't happen correctly anymore after RTO (since non-SACK
> > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > cumulatively acked - 0 to tp->delivered on those ACKs.
> >
> >> therefore tp->delivered is tracking the # of packets delivered
> >> (sacked, acked, DUPACK'd) with the most information it could have
> >> inferred.
> >
> > Since you didn't answer any of my questions about tp->delivered directly,
> > let me rephrase them to this example (non-SACK, of course):
> >
>

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-28 Thread Yuchung Cheng

On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
<ilpo.jarvi...@helsinki.fi> wrote:
> On Tue, 27 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
>> <ilpo.jarvi...@helsinki.fi> wrote:
>> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> >> <ilpo.jarvi...@helsinki.fi> wrote:
>> >> >
>> >> > A miscalculation for the number of acknowledged packets occurs during
>> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> >> > any non-retransmitted skbs. The reason is that pkts_acked value
>> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> >> > RTO as it may include segments that were not lost and therefore did
>> >> > not need retransmissions in the slow start following the RTO. Then
>> >> > tcp_slow_start will add the excess into cwnd bloating it and
>> >> > triggering a burst.
>> >> >
>> >> > Instead, we want to pass only the number of retransmitted segments
>> >> > that were covered by the cumulative ACK (and potentially newly sent
>> >> > data segments too if the cumulative ACK covers that far).
>> >> >
>> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
>> >> > ---
>> >>
>> >> My understanding is there are two problems
>> >>
>> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> >> precise cumulatively acked count, not newly acked count?
>> >
>> > While I'm not entirely sure if you intented to say that my fix is broken
>> > or not, I thought this very difference alot while making the fix and I
>> > believe that this fix is needed because of the discontinuity at RTO
>> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
>> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
>> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
>> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>> >
>> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
>> > tp->delivered, using plain cumulative acked causes congestion control
>> > breakage later as call to tcp_cong_control will directly use the
>> > difference in tp->delivered.
>> >
>> > This boils down the exact definition of tp->delivered (the one given in
>> > the header is not detailed enough). I guess you might have better idea
>> > what it exactly is since one of you has added it? There are subtle things
>> > in the defination that can make it entirely unsuitable for cc decisions.
>> > Should those segments that we (possibly) already counted into
>> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
>> > for _second time_ or not? This fix avoids such double counting (the
>> Where is the double counting, assuming normal DUPACK behavior?
>>
>> In the non-sack case:
>>
>> 1. upon receiving a DUPACK, we assume one packet has been delivered by
>> incrementing tp->delivered in tcp_add_reno_sack()
>
> 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> I've tried to point out quite many times already)...
>
>> 2. upon receiving a partial ACK or an ACK that acks recovery point
>> (high_seq), tp->delivered is incremented by (cumulatively acked -
>> #dupacks) in tcp_remove_reno_sacks()
>
> ...and this won't happen correctly anymore after RTO (since non-SACK
> won't keep #dupacks due to the discontinuity). Thus we end up adding
> cumulatively acked - 0 to tp->delivered on those ACKs.
>
>> therefore tp->delivered is tracking the # of packets delivered
>> (sacked, acked, DUPACK'd) with the most information it could have
>> inferred.
>
> Since you didn't answer any of my questions about tp->delivered directly,
> let me rephrase them to this example (non-SACK, of course):
>
> 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> three segments. How much should tp->delivered be incremented? 2 or 3?
>
> ...I think 2 is the right answer.
>
>> From congestion control's perspective, it cares about the delivery
>> information (e.g. how much), not the sequences (what or how).
>
> I guess you must have missed my poi

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-27 Thread Yuchung Cheng

On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
<ilpo.jarvi...@helsinki.fi> wrote:
> On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> <ilpo.jarvi...@helsinki.fi> wrote:
>> >
>> > A miscalculation for the number of acknowledged packets occurs during
>> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> > any non-retransmitted skbs. The reason is that pkts_acked value
>> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> > RTO as it may include segments that were not lost and therefore did
>> > not need retransmissions in the slow start following the RTO. Then
>> > tcp_slow_start will add the excess into cwnd bloating it and
>> > triggering a burst.
>> >
>> > Instead, we want to pass only the number of retransmitted segments
>> > that were covered by the cumulative ACK (and potentially newly sent
>> > data segments too if the cumulative ACK covers that far).
>> >
>> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
>> > ---
>> >  net/ipv4/tcp_input.c | 16 +++-
>> >  1 file changed, 15 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> > index 9a1b3c1..4a26c09 100644
>> > --- a/net/ipv4/tcp_input.c
>> > +++ b/net/ipv4/tcp_input.c
>> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
>> > prior_fack,
>> > long seq_rtt_us = -1L;
>> > long ca_rtt_us = -1L;
>> > u32 pkts_acked = 0;
>> > +   u32 rexmit_acked = 0;
>> > +   u32 newdata_acked = 0;
>> > u32 last_in_flight = 0;
>> > bool rtt_update;
>> > int flag = 0;
>> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
>> > prior_fack,
>> > }
>> >
>> > if (unlikely(sacked & TCPCB_RETRANS)) {
>> > -   if (sacked & TCPCB_SACKED_RETRANS)
>> > +   if (sacked & TCPCB_SACKED_RETRANS) {
>> > tp->retrans_out -= acked_pcount;
>> > +   rexmit_acked += acked_pcount;
>> > +   }
>> > flag |= FLAG_RETRANS_DATA_ACKED;
>> > } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>> > last_ackt = skb->skb_mstamp;
>> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
>> > prior_fack,
>> > reord = start_seq;
>> > if (!after(scb->end_seq, tp->high_seq))
>> > flag |= FLAG_ORIG_SACK_ACKED;
>> > +   else
>> > +   newdata_acked += acked_pcount;
>> > }
>> >
>> > if (sacked & TCPCB_SACKED_ACKED) {
>> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
>> > prior_fack,
>> > }
>> >
>> > if (tcp_is_reno(tp)) {
>> > +   /* Due to discontinuity on RTO in the artificial
>> > +* sacked_out calculations, TCP must restrict
>> > +* pkts_acked without SACK to rexmits and new data
>> > +* segments
>> > +*/
>> > +   if (icsk->icsk_ca_state == TCP_CA_Loss)
>> > +   pkts_acked = rexmit_acked + newdata_acked;
>> > +
>> My understanding is there are two problems
>>
>> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> precise cumulatively acked count, not newly acked count?
>
> While I'm not entirely sure if you intented to say that my fix is broken
> or not, I thought this very difference alot while making the fix and I
> believe that this fix is needed because of the discontinuity at RTO
> (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>
> In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> tp->

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-26 Thread Yuchung Cheng

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
 wrote:
>
> A miscalculation for the number of acknowledged packets occurs during
> RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> any non-retransmitted skbs. The reason is that pkts_acked value
> calculated in tcp_clean_rtx_queue is not correct for slow start after
> RTO as it may include segments that were not lost and therefore did
> not need retransmissions in the slow start following the RTO. Then
> tcp_slow_start will add the excess into cwnd bloating it and
> triggering a burst.
>
> Instead, we want to pass only the number of retransmitted segments
> that were covered by the cumulative ACK (and potentially newly sent
> data segments too if the cumulative ACK covers that far).
>
> Signed-off-by: Ilpo Järvinen 
> ---
>  net/ipv4/tcp_input.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..4a26c09 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
> long seq_rtt_us = -1L;
> long ca_rtt_us = -1L;
> u32 pkts_acked = 0;
> +   u32 rexmit_acked = 0;
> +   u32 newdata_acked = 0;
> u32 last_in_flight = 0;
> bool rtt_update;
> int flag = 0;
> @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
> }
>
> if (unlikely(sacked & TCPCB_RETRANS)) {
> -   if (sacked & TCPCB_SACKED_RETRANS)
> +   if (sacked & TCPCB_SACKED_RETRANS) {
> tp->retrans_out -= acked_pcount;
> +   rexmit_acked += acked_pcount;
> +   }
> flag |= FLAG_RETRANS_DATA_ACKED;
> } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
> reord = start_seq;
> if (!after(scb->end_seq, tp->high_seq))
> flag |= FLAG_ORIG_SACK_ACKED;
> +   else
> +   newdata_acked += acked_pcount;
> }
>
> if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
> }
>
> if (tcp_is_reno(tp)) {
> +   /* Due to discontinuity on RTO in the artificial
> +* sacked_out calculations, TCP must restrict
> +* pkts_acked without SACK to rexmits and new data
> +* segments
> +*/
> +   if (icsk->icsk_ca_state == TCP_CA_Loss)
> +   pkts_acked = rexmit_acked + newdata_acked;
> +
My understanding is there are two problems

1) your fix: the reordering logic in tcp-remove_reno_sacks requires
precise cumulatively acked count, not newly acked count?


2) current code: pkts_acked can substantially over-estimate the newly
delivered pkts in both SACK and non-SACK cases. For example, let's say
99/100 packets are already sacked, and the next ACK acks 100 pkts.
pkts_acked == 100 but really only one packet is delivered. It's wrong
to inform congestion control that 100 packets have just delivered.
AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
the newly delivered packets.

A better fix for both SACK and non-SACK, seems to be moving
ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
calibrated? this is what BBR is currently doing to avoid these pitfalls.


> tcp_remove_reno_sacks(sk, pkts_acked);
> } else {
> int delta;
> --
> 2.7.4
>

Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-07 Thread Yuchung Cheng

On Wed, Mar 7, 2018 at 12:19 PM, Neal Cardwell  wrote:
>
> On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen  
> wrote:
> > A bogus undo may/will trigger when the loss recovery state is
> > kept until snd_una is above high_seq. If tcp_any_retrans_done
> > is zero, retrans_stamp is cleared in this transient state. On
> > the next ACK, tcp_try_undo_recovery again executes and
> > tcp_may_undo will always return true because tcp_packet_delayed
> > has this condition:
> > return !tp->retrans_stamp || ...
> >
> > Check for the false fast retransmit transient condition in
> > tcp_packet_delayed to avoid bogus undos. Since snd_una may have
> > advanced on this ACK but CA state still remains unchanged,
> > prior_snd_una needs to be passed instead of tp->snd_una.
>
> This one also seems like a case where it would be nice to have a
> specific packet-by-packet example, or trace, or packetdrill scenario.
> Something that we might be able to translate into a test, or at least
> to document the issue more explicitly.
I am hesitate for further logic to make undo "perfect" on non-sack
cases b/c undo is very complicated and SACK is extremely
well-supported today. so a trace to demonstrate how severe this issue
is appreciated.

>
> Thanks!
> neal

Re: [PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-07 Thread Yuchung Cheng

On Wed, Mar 7, 2018 at 11:24 AM, Neal Cardwell  wrote:
>
> On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen  
> wrote:
> >
> > In a non-SACK case, any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case). This causes bogus undos in ordinary RTO recoveries where
> > segments are lost here and there, with a few delivered segments in
> > between losses. A cumulative ACKs will cover retransmitted ones at
> > the bottom and the non-retransmitted ones following that causing
> > FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results
> > in a spurious FRTO undo.
> >
> > We need to make the check more strict for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted,
> > which would be the case for the last step of FRTO algorithm as we
> > sent out only new segments previously. Only then, allow FRTO undo
> > to proceed in non-SACK case.
>
> Hi Ilpo - Do you have a packet trace or (even better) packetdrill
> script illustrating this issue? It would be nice to have a test case
> or at least concrete example of this.
a packetdrill or even a contrived example would be good ... also why
not just avoid setting FLAG_ORIG_SACK_ACKED on non-sack? seems a much
clean fix.

>
> Thanks!
> neal

[PATCH 2/2 net] tcp: revert F-RTO extension to detect more spurious timeouts

2018-02-27 Thread Yuchung Cheng

This reverts commit 89fe18e44f7ee5ab1c90d0dff5835acee7751427.

While the patch could detect more spurious timeouts, it could cause
poor TCP performance on broken middle-boxes that modifies TCP packets
(e.g. receive window, SACK options). Since the performance gain is
much smaller compared to the potential loss. The best solution is
to fully revert the change.

Fixes: 89fe18e44f7e ("tcp: extend F-RTO to catch more spurious timeouts")
Reported-by: Teodor Milkov <t...@del.bg>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
---
 net/ipv4/tcp_input.c | 30 --
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cd8ea972dc65..8d480542aa07 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1909,6 +1909,7 @@ void tcp_enter_loss(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
struct sk_buff *skb;
+   bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
bool mark_lost;
 
@@ -1967,15 +1968,12 @@ void tcp_enter_loss(struct sock *sk)
tp->high_seq = tp->snd_nxt;
tcp_ecn_queue_cwr(tp);
 
-   /* F-RTO RFC5682 sec 3.1 step 1 mandates to disable F-RTO
-* if a previous recovery is underway, otherwise it may incorrectly
-* call a timeout spurious if some previously retransmitted packets
-* are s/acked (sec 3.2). We do not apply that retriction since
-* retransmitted skbs are permanently tagged with TCPCB_EVER_RETRANS
-* so FLAG_ORIG_SACK_ACKED is always correct. But we do disable F-RTO
-* on PTMU discovery to avoid sending new data.
+   /* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous
+* loss recovery is underway except recurring timeout(s) on
+* the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing
 */
tp->frto = net->ipv4.sysctl_tcp_frto &&
+  (new_recovery || icsk->icsk_retransmits) &&
   !inet_csk(sk)->icsk_mtup.probe_size;
 }
 
@@ -2628,18 +2626,14 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
tcp_try_undo_loss(sk, false))
return;
 
-   /* The ACK (s)acks some never-retransmitted data meaning not all
-* the data packets before the timeout were lost. Therefore we
-* undo the congestion window and state. This is essentially
-* the operation in F-RTO (RFC5682 section 3.1 step 3.b). Since
-* a retransmitted skb is permantly marked, we can apply such an
-* operation even if F-RTO was not used.
-*/
-   if ((flag & FLAG_ORIG_SACK_ACKED) &&
-   tcp_try_undo_loss(sk, tp->undo_marker))
-   return;
-
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
+   /* Step 3.b. A timeout is spurious if not all data are
+* lost, i.e., never-retransmitted data are (s)acked.
+*/
+   if ((flag & FLAG_ORIG_SACK_ACKED) &&
+   tcp_try_undo_loss(sk, true))
+   return;
+
if (after(tp->snd_nxt, tp->high_seq)) {
if (flag & FLAG_DATA_SACKED || is_dupack)
tp->frto = 0; /* Step 3.a. loss was real */
-- 
2.16.1.291.g4437f3f132-goog

[PATCH 0/2 net] revert a F-RTO extension due to broken middle-boxes

2018-02-27 Thread Yuchung Cheng

This patch series reverts a (non-standard) TCP F-RTO extension that aimed
to detect more spurious timeouts. Unfortunately it could result in poor
performance due to broken middle-boxes that modify TCP packets. E.g.
https://www.spinics.net/lists/netdev/msg484154.html
We believe the best and simplest solution is to just revert the change.

Yuchung Cheng (2):
  tcp: revert F-RTO middle-box workaround
  tcp: revert F-RTO extension to detect more spurious timeouts

 net/ipv4/tcp_input.c | 23 +++
 1 file changed, 7 insertions(+), 16 deletions(-)

-- 
2.16.1.291.g4437f3f132-goog

[PATCH 1/2 net] tcp: revert F-RTO middle-box workaround

2018-02-27 Thread Yuchung Cheng

This reverts commit cc663f4d4c97b7297fb45135ab23cfd508b35a77. While fixing
some broken middle-boxes that modifies receive window fields, it does not
address middle-boxes that strip off SACK options. The best solution is
to fully revert this patch and the root F-RTO enhancement.

Fixes: cc663f4d4c97 ("tcp: restrict F-RTO to work-around broken middle-boxes")
Reported-by: Teodor Milkov <t...@del.bg>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
---
 net/ipv4/tcp_input.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 575d3c1fb6e8..cd8ea972dc65 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1909,7 +1909,6 @@ void tcp_enter_loss(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
struct sk_buff *skb;
-   bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
bool mark_lost;
 
@@ -1968,17 +1967,15 @@ void tcp_enter_loss(struct sock *sk)
tp->high_seq = tp->snd_nxt;
tcp_ecn_queue_cwr(tp);
 
-   /* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous
-* loss recovery is underway except recurring timeout(s) on
-* the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing
-*
-* In theory F-RTO can be used repeatedly during loss recovery.
-* In practice this interacts badly with broken middle-boxes that
-* falsely raise the receive window, which results in repeated
-* timeouts and stop-and-go behavior.
+   /* F-RTO RFC5682 sec 3.1 step 1 mandates to disable F-RTO
+* if a previous recovery is underway, otherwise it may incorrectly
+* call a timeout spurious if some previously retransmitted packets
+* are s/acked (sec 3.2). We do not apply that retriction since
+* retransmitted skbs are permanently tagged with TCPCB_EVER_RETRANS
+* so FLAG_ORIG_SACK_ACKED is always correct. But we do disable F-RTO
+* on PTMU discovery to avoid sending new data.
 */
tp->frto = net->ipv4.sysctl_tcp_frto &&
-  (new_recovery || icsk->icsk_retransmits) &&
   !inet_csk(sk)->icsk_mtup.probe_size;
 }
 
-- 
2.16.1.291.g4437f3f132-goog

Re: A TLP implementation question

2018-02-13 Thread Yuchung Cheng

On Tue, Feb 13, 2018 at 4:27 PM, hiren panchasara
 wrote:
>
> Looking at current net-next to understand an aspect of TLP (tail loss
> probe) implementation.
>
> https://tools.ietf.org/html/draft-ietf-tcpm-rack-02 is the source of
> truth now for TLP and 6.2.1.  Phase 1: Scheduling a loss probe
> Step 1: Check conditions for scheduling a PTO. has following as one of
> the conditions:
> (d) The most recently transmitted data was not itself a TLP probe
> (i.e. a sender MUST NOT send consecutive TLP probes)
this is done by
1) calling tcp_write_xmit(push_one==2) in tcp_send_loss_probe()
2) avoid calling tcp_schedule_loss_probe() if push_one == 2 in tcp_write_xmit()
3) abort if one TLP probe is inflight by checking tlp_high_seq in
tcp-send_loss_probe()

consequently the sender will never schedule a PTO upon sending a probe
(new or rtx) to avoid consecutive probes.

hth.

>
> I would appreciate if someone can help me trace how current code is
> trying to enforce this requirement. How does it check/track that the
> last (re)transmitted packet was a tlp probe.
>
> Thanks in advance,
> Hiren

Re: [PATCH bpf-next v8 08/12] bpf: Add support for reading sk_state and more

2018-01-24 Thread Yuchung Cheng

On Tue, Jan 23, 2018 at 11:57 PM, Lawrence Brakmo  wrote:
> Add support for reading many more tcp_sock fields
>
>   state,same as sk->sk_state
>   rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
>   snd_ssthresh
>   rcv_nxt
>   snd_nxt
>   snd_una
>   mss_cache
>   ecn_flags
>   rate_delivered
>   rate_interval_us
>   packets_out
>   retrans_out
Might as well get ca_state, sacked_out and lost_out to estimate CA
states and the packets in flight?

>   total_retrans
>   segs_in
>   data_segs_in
>   segs_out
>   data_segs_out
>   sk_txhash
>   bytes_received (__u64)
>   bytes_acked(__u64)
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  include/uapi/linux/bpf.h |  20 +++
>  net/core/filter.c| 135 
> +++
>  2 files changed, 144 insertions(+), 11 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a8c40a..6998032 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -979,6 +979,26 @@ struct bpf_sock_ops {
> __u32 snd_cwnd;
> __u32 srtt_us;  /* Averaged RTT << 3 in usecs */
> __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
> +   __u32 state;
> +   __u32 rtt_min;
> +   __u32 snd_ssthresh;
> +   __u32 rcv_nxt;
> +   __u32 snd_nxt;
> +   __u32 snd_una;
> +   __u32 mss_cache;
> +   __u32 ecn_flags;
> +   __u32 rate_delivered;
> +   __u32 rate_interval_us;
> +   __u32 packets_out;
> +   __u32 retrans_out;
> +   __u32 total_retrans;
> +   __u32 segs_in;
> +   __u32 data_segs_in;
> +   __u32 segs_out;
> +   __u32 data_segs_out;
> +   __u32 sk_txhash;
> +   __u64 bytes_received;
> +   __u64 bytes_acked;
>  };
>
>  /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6936d19..ffe9b60 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
>  }
>  EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>
> -static bool __is_valid_sock_ops_access(int off, int size)
> +static bool sock_ops_is_valid_access(int off, int size,
> +enum bpf_access_type type,
> +struct bpf_insn_access_aux *info)
>  {
> +   const int size_default = sizeof(__u32);
> +
> if (off < 0 || off >= sizeof(struct bpf_sock_ops))
> return false;
> +
> /* The verifier guarantees that size > 0. */
> if (off % size != 0)
> return false;
> -   if (size != sizeof(__u32))
> -   return false;
> -
> -   return true;
> -}
>
> -static bool sock_ops_is_valid_access(int off, int size,
> -enum bpf_access_type type,
> -struct bpf_insn_access_aux *info)
> -{
> if (type == BPF_WRITE) {
> switch (off) {
> case offsetof(struct bpf_sock_ops, reply):
> +   if (size != size_default)
> +   return false;
> break;
> default:
> return false;
> }
> +   } else {
> +   switch (off) {
> +   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
> +   bytes_acked):
> +   if (size != sizeof(__u64))
> +   return false;
> +   break;
> +   default:
> +   if (size != size_default)
> +   return false;
> +   break;
> +   }
> }
>
> -   return __is_valid_sock_ops_access(off, size);
> +   return true;
>  }
>
>  static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
> @@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
> bpf_access_type type,
>is_fullsock));
> break;
>
> +   case offsetof(struct bpf_sock_ops, state):
> +   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 
> 1);
> +
> +   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +   struct bpf_sock_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_sock_ops_kern, sk));
> +   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, 
> skc_state));
> +   break;
> +
> +   case offsetof(struct bpf_sock_ops, rtt_min):
> +   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
> +sizeof(struct minmax));
> +

Re: [PATCH bpf-next v8 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-24 Thread Yuchung Cheng

On Tue, Jan 23, 2018 at 11:58 PM, Lawrence Brakmo  wrote:
> Adds support for calling sock_ops BPF program when there is a
> retransmission. Two arguments are used; one for the sequence number and
> other for the number of segments retransmitted. Does not include syn-ack
> retransmissions.
>
> New op: BPF_SOCK_OPS_RETRANS_CB.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  include/uapi/linux/bpf.h | 4 
>  include/uapi/linux/tcp.h | 3 ++-
>  net/ipv4/tcp_output.c| 3 +++
>  3 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6998032..eb26cdb 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1039,6 +1039,10 @@ enum {
>  * Arg2: value of icsk_rto
>  * Arg3: whether RTO has expired
>  */
> +   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
> +* Arg1: sequence number of 1st byte
> +* Arg2: # segments
> +*/
>  };
>
>  #define TCP_BPF_IW 1001/* Set TCP initial congestion window 
> */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 129032ca..ec03a2b 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
>
>  /* Definitions for bpf_sock_ops_cb_flags */
>  #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
> -#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all 
> currently
> +#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
> +#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all 
> currently
>  * supported cb flags
>  */
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d12f7f7..f7d34f01 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2908,6 +2908,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct 
> sk_buff *skb, int segs)
> if (likely(!err)) {
> TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
> trace_tcp_retransmit_skb(sk, skb);
> +   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
> +   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
> + TCP_SKB_CB(skb)->seq, segs);
Any reason to skip failed retransmission? I would think that's helpful as well.

> } else if (err != -EBUSY) {
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
> }
> --
> 2.9.5
>

Re: [PATCH bpf-next v8 04/12] bpf: Only reply field should be writeable

2018-01-24 Thread Yuchung Cheng

On Tue, Jan 23, 2018 at 11:57 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> Currently, a sock_ops BPF program can write the op field and all the
> reply fields (reply and replylong). This is a bug. The op field should
> not have been writeable and there is currently no way to use replylong
> field for indices >= 1. This patch enforces that only the reply field
> (which equals replylong[0]) is writeable.
Would this patch be more suitable for -net ?

>
> Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
> Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Yuchung Cheng <ych...@google.com>

> ---
>  net/core/filter.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0cf170f..c356ec0 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
>  {
> if (type == BPF_WRITE) {
> switch (off) {
> -   case offsetof(struct bpf_sock_ops, op) ...
> -offsetof(struct bpf_sock_ops, replylong[3]):
> +   case offsetof(struct bpf_sock_ops, reply):
> break;
> default:
> return false;
> --
> 2.9.5
>

Re: [PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Yuchung Cheng

On Tue, Jan 23, 2018 at 3:30 PM, Lawrence Brakmo <bra...@fb.com> wrote:
>
>
>
> On 1/23/18, 3:26 PM, "Alexei Starovoitov" <alexei.starovoi...@gmail.com> 
> wrote:
>
> On Tue, Jan 23, 2018 at 08:19:54PM +, Lawrence Brakmo wrote:
> > On 1/23/18, 11:50 AM, "Eric Dumazet" <eric.duma...@gmail.com> wrote:
> >
> > On Tue, 2018-01-23 at 14:39 -0500, Neal Cardwell wrote:
> > > On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> 
> wrote:
> > > > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > > >
> > > > The original patch that changes TCP's congestion control 
> via eBPF only
> > > > re-initializes the new congestion control, if the BPF op is 
> set to an
> > > > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently 
> TCP will
> > > >
> > > > What do you mean by “(invalid) value”?
> > > >
> > > > run the new congestion control from random states.
> > > >
> > > > This has always been possible with setsockopt, no?
> > > >
> > > >This patch fixes
> > > > the issue by always re-init the congestion control like 
> other means
> > > > such as setsockopt and sysctl changes.
> > > >
> > > > The current code re-inits the congestion control when calling
> > > > tcp_set_congestion_control except when it is called early on 
> (i.e. op <=
> > > > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to 
> re-initialize
> > > > since it will be initialized later by TCP when the connection 
> is established.
>     > > >
> > > > Otherwise, if we always call tcp_reinit_congestion_control we 
> would call
> > > > tcp_cleanup_congestion_control when the congestion control has 
> not been
> > > > initialized yet.
> > >
> > > On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> 
> wrote:
> > > > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > > >
> > > > The original patch that changes TCP's congestion control 
> via eBPF only
> > > > re-initializes the new congestion control, if the BPF op is 
> set to an
> > > > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently 
> TCP will
> > > >
> > > > What do you mean by “(invalid) value”?
> > > >
> > > > run the new congestion control from random states.
> > > >
> > > > This has always been possible with setsockopt, no?
> > > >
> > > >This patch fixes
> > > > the issue by always re-init the congestion control like 
> other means
> > > > such as setsockopt and sysctl changes.
> > > >
> > > > The current code re-inits the congestion control when calling
> > > > tcp_set_congestion_control except when it is called early on 
> (i.e. op <=
> > > > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to 
> re-initialize
> > > > since it will be initialized later by TCP when the connection 
> is established.
> > > >
> > > > Otherwise, if we always call tcp_reinit_congestion_control we 
> would call
> > > > tcp_cleanup_congestion_control when the congestion control has 
> not been
> > > > initialized yet.
> > >
> > > Interesting. So I wonder if the symptoms we were seeing were due 
> to
> > > kernels that did not yet have this fix:
> > >
> > >   27204aaa9dc6 ("tcp: uniform the set up of sockets after 
> successful
> > > connection):
> > >   
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=27204aaa9dc67b833b77179fdac556288ec3a4bf
> > >
> > > Before that fix, there could be TFO passive connections that at 
> SYN time called:
> > >   tcp_init_congestion_control(child);
> > > and then:
> > >

Re: [PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Yuchung Cheng

On Tue, Jan 23, 2018 at 12:19 PM, Lawrence Brakmo <bra...@fb.com> wrote:
>
> On 1/23/18, 11:50 AM, "Eric Dumazet" <eric.duma...@gmail.com> wrote:
>
> On Tue, 2018-01-23 at 14:39 -0500, Neal Cardwell wrote:
> > On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> > > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > >
> > > The original patch that changes TCP's congestion control via eBPF 
> only
> > > re-initializes the new congestion control, if the BPF op is set 
> to an
> > > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP 
> will
> > >
> > > What do you mean by “(invalid) value”?
> > >
> > > run the new congestion control from random states.
> > >
> > > This has always been possible with setsockopt, no?
> > >
> > >This patch fixes
> > > the issue by always re-init the congestion control like other 
> means
> > > such as setsockopt and sysctl changes.
> > >
> > > The current code re-inits the congestion control when calling
> > > tcp_set_congestion_control except when it is called early on (i.e. op 
> <=
> > > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to 
> re-initialize
> > > since it will be initialized later by TCP when the connection is 
> established.
> > >
> > > Otherwise, if we always call tcp_reinit_congestion_control we would 
> call
> > > tcp_cleanup_congestion_control when the congestion control has not 
> been
> > > initialized yet.
> >
> > On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> > > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > >
> > > The original patch that changes TCP's congestion control via eBPF 
> only
> > > re-initializes the new congestion control, if the BPF op is set 
> to an
> > > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP 
> will
> > >
> > > What do you mean by “(invalid) value”?
> > >
> > > run the new congestion control from random states.
> > >
> > > This has always been possible with setsockopt, no?
> > >
> > >This patch fixes
> > > the issue by always re-init the congestion control like other 
> means
> > > such as setsockopt and sysctl changes.
> > >
> > > The current code re-inits the congestion control when calling
> > > tcp_set_congestion_control except when it is called early on (i.e. op 
> <=
> > > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to 
> re-initialize
> > > since it will be initialized later by TCP when the connection is 
> established.
> > >
> > > Otherwise, if we always call tcp_reinit_congestion_control we would 
> call
> > > tcp_cleanup_congestion_control when the congestion control has not 
> been
> > > initialized yet.
> >
> > Interesting. So I wonder if the symptoms we were seeing were due to
> > kernels that did not yet have this fix:
> >
> >   27204aaa9dc6 ("tcp: uniform the set up of sockets after successful
> > connection):
> >   
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=27204aaa9dc67b833b77179fdac556288ec3a4bf
> >
> > Before that fix, there could be TFO passive connections that at SYN 
> time called:
> >   tcp_init_congestion_control(child);
> > and then:
> >   tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
> >
> > So that if the CC was switched in the
> > BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB handler then there would be no
> > init for the new module?
>
>
> Note that bpf_sock->op can be written by a malicious BPF filter.
>
> So, a malicious filter can switch from Cubic to BBR without re-init,
> and bad things can happen.
>
> I do not believe we should trust BPF here.
>
> Very good point Eric. One solution would be to make bpf_sock->op not 
> writeable by
> the BPF program.
>
> Neal, you are correct that would have been a problem. I leave it up to you 
> guys whether
> making bpf_sock->op not writeable by BPF program is enough or if it is safer 
> to always
> re-init (as long as there is no problem calling 
> tcp_cleanup_congestion_control when it
> has not been initialized.
Thank you Larry for the clarification. I prefer the latter approach
and will respin.

>
>
>

[PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Yuchung Cheng

The original patch that changes TCP's congestion control via eBPF only
re-initializes the new congestion control, if the BPF op is set to an
(invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP will
run the new congestion control from random states. This patch fixes
the issue by always re-init the congestion control like other means
such as setsockopt and sysctl changes.

Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
---
 include/net/tcp.h   |  2 +-
 net/core/filter.c   |  3 +--
 net/ipv4/tcp.c  |  2 +-
 net/ipv4/tcp_cong.c | 11 ++-
 4 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d2f022..f94a71b62ba5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1006,7 +1006,7 @@ void tcp_get_default_congestion_control(struct net *net, 
char *name);
 void tcp_get_available_congestion_control(char *buf, size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
-int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, 
bool reinit);
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
 u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 6a85e67fafce..757d52adccfc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3233,12 +3233,11 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
char name[TCP_CA_NAME_MAX];
-   bool reinit = bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN;
 
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f08eebe60446..21e2a07e857e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2550,7 +2550,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
name[val] = 0;
 
lock_sock(sk);
-   err = tcp_set_congestion_control(sk, name, true, true);
+   err = tcp_set_congestion_control(sk, name, true);
release_sock(sk);
return err;
}
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index bc6c02f16243..70895bee3026 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -332,7 +332,7 @@ int tcp_set_allowed_congestion_control(char *val)
  * tcp_reinit_congestion_control (if the current congestion control was
  * already initialized.
  */
-int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, 
bool reinit)
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_congestion_ops *ca;
@@ -356,15 +356,8 @@ int tcp_set_congestion_control(struct sock *sk, const char 
*name, bool load, boo
if (!ca) {
err = -ENOENT;
} else if (!load) {
-   const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
-
if (try_module_get(ca->owner)) {
-   if (reinit) {
-   tcp_reinit_congestion_control(sk, ca);
-   } else {
-   icsk->icsk_ca_ops = ca;
-   module_put(old_ca->owner);
-   }
+   tcp_reinit_congestion_control(sk, ca);
} else {
err = -EBUSY;
}
-- 
2.16.0.rc1.238.g530d649a79-goog

Re: [PATCH net-next] tcp: avoid negotitating ECN for BBR

2018-01-22 Thread Yuchung Cheng

On Fri, Jan 19, 2018 at 11:31 AM, David Miller <da...@davemloft.net> wrote:
>
> From: Yuchung Cheng <ych...@google.com>
> Date: Tue, 16 Jan 2018 17:57:26 -0800
>
> > This patch keeps BBR from negotiating ECN if sysctl ECN is
> > set. Prior to this patch, BBR negotiates ECN if enabled, sends
> > CWR upon receiving ECE ACKs but does not react to them. This can
> > cause confusion from the protocol perspective. Therefore this
> > patch prevents the connection from negotiating ECN if BBR is the
> > congestion control during the handshake.
> >
> > Note that after the handshake, the user can still switch to a
> > different congestion control that supports or even requires ECN
> > (e.g. DCTCP).  In that case, the connection can not re-negotiate
> > ECN and has to go with the ECN-free mode in that congestion control.
> >
> > There are other cases BBR would still respond to ECE ACKs with CWR
> > but does not react to it like the behavior before this patch. First,
> > when the user switches to BBR congestion control but the connection
> > has already negotiated ECN before. Second, the system has configured
> > the ip route and/or uses eBPF to enable ECN on the connection that
> > uses BBR congestion control.
> >
> > Signed-off-by: Yuchung Cheng <ych...@google.com>
> > Signed-off-by: Neal Cardwell <ncardw...@google.com>
> > Acked-by: Yousuk Seung <ysse...@google.com>
> > Acked-by: Eric Dumazet <eduma...@google.com>
>
> Well, this is a bit disappointing.  I'm having trouble justifying
> applying this.
>
> Why doesn't BBR react to ECN notifications?  Is it because BBR's
> idea of congestion differs from the one ECN is likely indicating?
>
> This is really unfortunate, because if BBR does become quite prominent
> (that's what you want right? :-) then what little success there has
> been deploying working ECN will be for almost nothing, and there
> will be little incentive for further ECN deployment.
>
> And the weird behavior you list in your last paragraph, about how if
> the user switches to BBR then ECN will be active, is just a red flag
> that shows perhaps this is a bad idea overall.
>
> ECN behavior should not be so tightly bound to the congestion control
> algorithm like this, it's a connection property independant of
> congestion control algorithm.
>
> I'm not applying this for now, sorry.  Maybe if you significantly
> enhance the commit message and try to do something sane with the
> algorithm switching case it is work a respin.
Thank you for your feedback. We have not yet find a good approach to
react to the standard ECN (RFC3168) coarse early loss model. The
new proposal of Accurate ECN may provide better and compatible
signals to BBR, which we're still exploring in an early stage.

Your feedback on the weird behavior makes sense. We'll respin the
patch to (hopefully) address that instead of bluntly not negotiating
ECN.

>
> Thanks.

[PATCH 1/2 net-next] tcp: avoid min-RTT overestimation from delayed ACKs

2018-01-17 Thread Yuchung Cheng

This patch avoids having TCP sender or congestion control
overestimate the min RTT by orders of magnitude. This happens when
all the samples in the windowed filter are one-packet transfer
like small request and health-check like chit-chat, which is farily
common for applications using persistent connections. This patch
tries to conservatively labels and skip RTT samples obtained from
this type of workload.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ff71b18d9682..2c6797134553 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -97,6 +97,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
 #define FLAG_UPDATE_TS_RECENT  0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK  0x8000 /* do not call tcp_send_challenge_ack()  
*/
+#define FLAG_ACK_MAYBE_DELAYED 0x1 /* Likely a delayed ACK */
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -2857,11 +2858,18 @@ static void tcp_fastretrans_alert(struct sock *sk, 
const u32 prior_snd_una,
*rexmit = REXMIT_LOST;
 }
 
-static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us)
+static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us, const int flag)
 {
u32 wlen = sock_net(sk)->ipv4.sysctl_tcp_min_rtt_wlen * HZ;
struct tcp_sock *tp = tcp_sk(sk);
 
+   if ((flag & FLAG_ACK_MAYBE_DELAYED) && rtt_us > tcp_min_rtt(tp)) {
+   /* If the remote keeps returning delayed ACKs, eventually
+* the min filter would pick it up and overestimate the
+* prop. delay when it expires. Skip suspected delayed ACKs.
+*/
+   return;
+   }
minmax_running_min(>rtt_min, wlen, tcp_jiffies32,
   rtt_us ? : jiffies_to_usecs(1));
 }
@@ -2901,7 +2909,7 @@ static bool tcp_ack_update_rtt(struct sock *sk, const int 
flag,
 * always taken together with ACK, SACK, or TS-opts. Any negative
 * values will be skipped with the seq_rtt_us < 0 check above.
 */
-   tcp_update_rtt_min(sk, ca_rtt_us);
+   tcp_update_rtt_min(sk, ca_rtt_us, flag);
tcp_rtt_estimator(sk, seq_rtt_us);
tcp_set_rto(sk);
 
@@ -3125,6 +3133,17 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
if (likely(first_ackt) && !(flag & FLAG_RETRANS_DATA_ACKED)) {
seq_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, first_ackt);
ca_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, last_ackt);
+
+   if (pkts_acked == 1 && last_in_flight < tp->mss_cache &&
+   last_in_flight && !prior_sacked && fully_acked &&
+   sack->rate->prior_delivered + 1 == tp->delivered &&
+   !(flag & (FLAG_CA_ALERT | FLAG_SYN_ACKED))) {
+   /* Conservatively mark a delayed ACK. It's typically
+* from a lone runt packet over the round trip to
+* a receiver w/o out-of-order or CE events.
+*/
+   flag |= FLAG_ACK_MAYBE_DELAYED;
+   }
}
if (sack->first_sackt) {
sack_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, 
sack->first_sackt);
-- 
2.16.0.rc1.238.g530d649a79-goog

[PATCH 0/2 net-next] tcp: do not use RTT from delayed ACKs for min-RTT

2018-01-17 Thread Yuchung Cheng

This patch set prevents TCP sender from using RTT samples from
(suspected) delayed ACKs as the minimum RTT, to avoid unbounded
over-estimation of the network path delay. This issue is common
when a connection has extended periods of one packet chit-chat
beyond the min RTT filter window. The first patch does that for TCP
general min RTT estimation. The second patch addresses specifically
the BBR congestion control's min RTT filter.

Yuchung Cheng (2):
  tcp: avoid min-RTT overestimation from delayed ACKs
  tcp: avoid min RTT bloat by skipping RTT from delayed-ACK in BBR

 include/net/tcp.h|  1 +
 net/ipv4/tcp_bbr.c   |  3 ++-
 net/ipv4/tcp_input.c | 24 ++--
 3 files changed, 25 insertions(+), 3 deletions(-)

-- 
2.16.0.rc1.238.g530d649a79-goog

[PATCH 2/2 net-next] tcp: avoid min RTT bloat by skipping RTT from delayed-ACK in BBR

2018-01-17 Thread Yuchung Cheng

A persistent connection may send tiny amount of data (e.g. health-check)
for a long period of time. BBR's windowed min RTT filter may only see
RTT samples from delayed ACKs causing BBR to grossly over-estimate
the path delay depending how much the ACK was delayed at the receiver.

This patch skips RTT samples that are likely coming from delayed ACKs. Note
that it is possible the sender never obtains a valid measure to set the
min RTT. In this case BBR will continue to set cwnd to initial window
which seems fine because the connection is thin stream.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
Acked-by: Priyaranjan Jha <priyar...@google.com>
---
 include/net/tcp.h| 1 +
 net/ipv4/tcp_bbr.c   | 3 ++-
 net/ipv4/tcp_input.c | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69d3c37..5a1d26a18599 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -953,6 +953,7 @@ struct rate_sample {
u32  prior_in_flight;   /* in flight before this ACK */
bool is_app_limited;/* is sample from packet with bubble in pipe? */
bool is_retrans;/* is sample from retransmission? */
+   bool is_ack_delayed;/* is this (likely) a delayed ACK? */
 };
 
 struct tcp_congestion_ops {
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 8322f26e770e..785712be5b0d 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -766,7 +766,8 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
filter_expired = after(tcp_jiffies32,
   bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
if (rs->rtt_us >= 0 &&
-   (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
+   (rs->rtt_us <= bbr->min_rtt_us ||
+(filter_expired && !rs->is_ack_delayed))) {
bbr->min_rtt_us = rs->rtt_us;
bbr->min_rtt_stamp = tcp_jiffies32;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c6797134553..cfa51cfd2d99 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3633,6 +3633,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
 
delivered = tp->delivered - delivered;  /* freshly ACKed or SACKed */
lost = tp->lost - lost; /* freshly marked lost */
+   rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
tcp_cong_control(sk, ack, delivered, flag, sack_state.rate);
tcp_xmit_recovery(sk, rexmit);
-- 
2.16.0.rc1.238.g530d649a79-goog

[PATCH net-next] tcp: avoid negotitating ECN for BBR

2018-01-16 Thread Yuchung Cheng

This patch keeps BBR from negotiating ECN if sysctl ECN is
set. Prior to this patch, BBR negotiates ECN if enabled, sends
CWR upon receiving ECE ACKs but does not react to them. This can
cause confusion from the protocol perspective. Therefore this
patch prevents the connection from negotiating ECN if BBR is the
congestion control during the handshake.

Note that after the handshake, the user can still switch to a
different congestion control that supports or even requires ECN
(e.g. DCTCP).  In that case, the connection can not re-negotiate
ECN and has to go with the ECN-free mode in that congestion control.

There are other cases BBR would still respond to ECE ACKs with CWR
but does not react to it like the behavior before this patch. First,
when the user switches to BBR congestion control but the connection
has already negotiated ECN before. Second, the system has configured
the ip route and/or uses eBPF to enable ECN on the connection that
uses BBR congestion control.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Yousuk Seung <ysse...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 include/net/tcp.h | 7 +++
 net/ipv4/tcp_bbr.c| 2 +-
 net/ipv4/tcp_input.c  | 3 ++-
 net/ipv4/tcp_output.c | 6 --
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69d3c37..22345132d969 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -925,6 +925,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
 #define TCP_CONG_NEEDS_ECN 0x2
+/* Does not use or react to ECN */
+#define TCP_CONG_DONT_USE_ECN  0x4
 
 union tcp_cc_info;
 
@@ -1033,6 +1035,11 @@ static inline bool tcp_ca_needs_ecn(const struct sock 
*sk)
return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ECN;
 }
 
+static inline bool tcp_ca_uses_ecn(const struct sock *sk)
+{
+   return !(inet_csk(sk)->icsk_ca_ops->flags & TCP_CONG_DONT_USE_ECN);
+}
+
 static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 8322f26e770e..27456554b113 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -926,7 +926,7 @@ static void bbr_set_state(struct sock *sk, u8 new_state)
 }
 
 static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
-   .flags  = TCP_CONG_NON_RESTRICTED,
+   .flags  = TCP_CONG_NON_RESTRICTED | TCP_CONG_DONT_USE_ECN,
.name   = "bbr",
.owner  = THIS_MODULE,
.init   = bbr_init,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ff71b18d9682..6731d0b9b146 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6090,7 +6090,8 @@ static void tcp_ecn_create_request(struct request_sock 
*req,
 
ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK);
-   ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
+   ecn_ok = ecn_ok_dst ||
+(net->ipv4.sysctl_tcp_ecn && tcp_ca_uses_ecn(listen_sk));
 
if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
(ecn_ok_dst & DST_FEATURE_ECN_CA) ||
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 95461f02ac9a..446cb65090f5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -312,8 +312,10 @@ static void tcp_ecn_send_syn(struct sock *sk, struct 
sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
-   bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
-   tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
+   bool use_ecn = tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
+
+   if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 && tcp_ca_uses_ecn(sk))
+   use_ecn = true;
 
if (!use_ecn) {
const struct dst_entry *dst = __sk_dst_get(sk);
-- 
2.16.0.rc1.238.g530d649a79-goog

Re: [PATCH v2 bpf-next 06/11] bpf: Add sock_ops RTO callback

2017-12-28 Thread Yuchung Cheng

On Thu, Dec 21, 2017 at 5:20 PM, Lawrence Brakmo  wrote:
>
> Adds an optional call to sock_ops BPF program based on whether the
> BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
> The BPF program is passed 2 arguments: icsk_retransmits and whether the
> RTO has expired.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  include/uapi/linux/bpf.h | 5 +
>  include/uapi/linux/tcp.h | 3 +++
>  net/ipv4/tcp_timer.c | 9 +
>  3 files changed, 17 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 62b2c89..3cf9014 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -995,6 +995,11 @@ enum {
>  * a congestion threshold. RTTs above
>  * this indicate congestion
>  */
> +   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
> +* Arg1: value of icsk_retransmits
> +* Arg2: value of icsk_rto
> +* Arg3: whether RTO has expired
> +*/
>  };
>
>  #define TCP_BPF_IW 1001/* Set TCP initial congestion window 
> */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index b4a4f64..089c19e 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -259,6 +259,9 @@ struct tcp_md5sig {
> __u8tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */
>  };
>
> +/* Definitions for bpf_sock_ops_flags */
> +#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
> +
>  /* INET_DIAG_MD5SIG */
>  struct tcp_diag_md5sig {
> __u8tcpm_family;
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 6db3124..f9c57e2 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -215,9 +215,18 @@ static int tcp_write_timeout(struct sock *sk)
> tcp_fastopen_active_detect_blackhole(sk, expired);
can't we just call it here once w/ 'expired' as a parameter, instead
of duplicating the code?

> if (expired) {
> /* Has it gone just too far? */
> +   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
> +   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
> + icsk->icsk_retransmits,
> + icsk->icsk_rto, 1);
> tcp_write_err(sk);
> return 1;
> }
> +
> +   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
> +   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
> + icsk->icsk_retransmits,
> + icsk->icsk_rto, 0);
> return 0;
>  }
>
> --
> 2.9.5
>

[PATH net-next] tcp: pause Fast Open globally after third consecutive timeout

2017-12-12 Thread Yuchung Cheng

Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.

This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout.  Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.

Reported-by: Dragana Damjanovic <ddamjano...@mozilla.com>
Reported-by: Patrick McManus <mcma...@ducksong.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Wei Wang <wei...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 Documentation/networking/ip-sysctl.txt |  1 +
 include/net/tcp.h  |  5 ++---
 net/ipv4/tcp_fastopen.c| 30 --
 net/ipv4/tcp_metrics.c |  5 +
 net/ipv4/tcp_timer.c   | 17 +
 5 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 46c7e1085efc..3f2c40d8e6aa 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -606,6 +606,7 @@ tcp_fastopen_blackhole_timeout_sec - INTEGER
This time period will grow exponentially when more blackhole issues
get detected right after Fastopen is re-enabled and will reset to
initial value when the blackhole issue goes away.
+   0 to disable the blackhole detection.
By default, it is set to 1hr.
 
 tcp_syn_retries - INTEGER
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3c3744e52cd1..6939e69d3c37 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1507,8 +1507,7 @@ int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
 
 /* From tcp_fastopen.c */
 void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
-   struct tcp_fastopen_cookie *cookie, int *syn_loss,
-   unsigned long *last_syn_loss);
+   struct tcp_fastopen_cookie *cookie);
 void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
struct tcp_fastopen_cookie *cookie, bool syn_lost,
u16 try_exp);
@@ -1546,7 +1545,7 @@ extern unsigned int sysctl_tcp_fastopen_blackhole_timeout;
 void tcp_fastopen_active_disable(struct sock *sk);
 bool tcp_fastopen_active_should_disable(struct sock *sk);
 void tcp_fastopen_active_disable_ofo_check(struct sock *sk);
-void tcp_fastopen_active_timeout_reset(void);
+void tcp_fastopen_active_detect_blackhole(struct sock *sk, bool expired);
 
 /* Latencies incurred by various limits for a sender. They are
  * chronograph-like stats that are mutually exclusive.
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 78c192ee03a4..018a48477355 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -379,18 +379,9 @@ struct sock *tcp_try_fastopen(struct sock *sk, struct 
sk_buff *skb,
 bool tcp_fastopen_cookie_check(struct sock *sk, u16 *mss,
   struct tcp_fastopen_cookie *cookie)
 {
-   unsigned long last_syn_loss = 0;
const struct dst_entry *dst;
-   int syn_loss = 0;
 
-   tcp_fastopen_cache_get(sk, mss, cookie, _loss, _syn_loss);
-
-   /* Recurring FO SYN losses: no cookie or data in SYN */
-   if (syn_loss > 1 &&
-   time_before(jiffies, last_syn_loss + (60*HZ << syn_loss))) {
-   cookie->len = -1;
-   return false;
-   }
+   tcp_fastopen_cache_get(sk, mss, cookie);
 
/* Firewall blackhole issue check */
if (tcp_fastopen_active_should_disable(sk)) {
@@ -448,6 +439,8 @@ EXPORT_SYMBOL(tcp_fastopen_defer_connect);
  * following circumstances:
  *   1. client side TFO socket receives out of order FIN
  *   2. client side TFO socket receives out of order RST
+ *   3. client side TFO socket has timed out three times consecutively during
+ *  or after handshake
  * We disable active side TFO globally for 1hr at first. Then if it
  * happens again, we disable it for 2h, then 4h, 8h, ...
  * And we reset the timeout back to 1hr when we see a successful active
@@ -524,3 +517,20 @@ void tcp_fastopen_active_disable_ofo_check(struct sock *sk)
dst_release(dst);
}
 }
+
+void tcp_fastopen_active_detect_blackhole(struct sock

[PATCH net-next] tcp: allow TLP in ECN CWR

2017-12-11 Thread Yuchung Cheng

From: Neal Cardwell <ncardw...@google.com>

This patch enables tail loss probe in cwnd reduction (CWR) state
to detect potential losses. Prior to this patch, since the sender
uses PRR to determine the cwnd in CWR state, the combination of
CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls
upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer()
defers sending wait for more ACKs. The ACKs may not come due to
packet losses.

Disallowing TLP when there is unused cwnd had the primary effect
of disallowing TLP when there is TSO deferral, Nagle deferral,
or we hit the rwin limit. Because basically every application
write() or incoming ACK will cause us to run tcp_write_xmit()
to see if we can send more, and then if we sent something we call
tcp_schedule_loss_probe() to see if we should schedule a TLP. At
that point, there are a few common reasons why some cwnd budget
could still be unused:

(a) rwin limit
(b) nagle check
(c) TSO deferral
(d) TSQ

For (d), after the next packet tx completion the TSQ mechanism
will allow us to send more packets, so we don't really need a
TLP (in practice it shouldn't matter whether we schedule one
or not). But for (a), (b), (c) the sender won't send any more
packets until it gets another ACK. But if the whole flight was
lost, or all the ACKs were lost, then we won't get any more ACKs,
and ideally we should schedule and send a TLP to get more feedback.
In particular for a long time we have wanted some kind of timer for
TSO deferral, and at least this would give us some kind of timer

Reported-by: Steve Ibanez <siba...@stanford.edu>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Nandita Dukkipati <nandi...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_output.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a4d214c7b506..04be9f833927 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2414,15 +2414,12 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool 
advancing_rto)
 
early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
/* Schedule a loss probe in 2*RTT for SACK capable connections
-* in Open state, that are either limited by cwnd or application.
+* not in loss recovery, that are either limited by cwnd or application.
 */
if ((early_retrans != 3 && early_retrans != 4) ||
!tp->packets_out || !tcp_is_sack(tp) ||
-   icsk->icsk_ca_state != TCP_CA_Open)
-   return false;
-
-   if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
-!tcp_write_queue_empty(sk))
+   (icsk->icsk_ca_state != TCP_CA_Open &&
+icsk->icsk_ca_state != TCP_CA_CWR))
return false;
 
/* Probe timeout is 2*rtt. Add minimum RTO to account
-- 
2.15.1.424.g9478a66081-goog

[PATH net 1/4] tcp: correctly test congestion state in RACK

2017-12-07 Thread Yuchung Cheng

RACK does not test the loss recovery state correctly to compute
the reordering window. It assumes if lost_out is zero then TCP is
not in loss recovery. But it can be zero during recovery before
calling tcp_rack_detect_loss(): when an ACK acknowledges all
packets marked lost before receiving this ACK, but has not yet
to discover new ones by tcp_rack_detect_loss(). The fix is to
simply test the congestion state directly.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_recovery.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index d3ea89020c69..3143664902e9 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -55,7 +55,8 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 
*reo_timeout)
 * to queuing or delayed ACKs.
 */
reo_wnd = 1000;
-   if ((tp->rack.reord || !tp->lost_out) && min_rtt != ~0U) {
+   if ((tp->rack.reord || inet_csk(sk)->icsk_ca_state < TCP_CA_Recovery) &&
+   min_rtt != ~0U) {
reo_wnd = max((min_rtt >> 2) * tp->rack.reo_wnd_steps, reo_wnd);
reo_wnd = min(reo_wnd, tp->srtt_us >> 3);
}
-- 
2.15.1.424.g9478a66081-goog

[PATH net 3/4] tcp: fix off-by-one bug in RACK

2017-12-07 Thread Yuchung Cheng

RACK should mark a packet lost when remaining wait time is zero.

Signed-off-by: Yuchung Cheng <ych...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Priyaranjan Jha <priyar...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_recovery.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 3143664902e9..0c182303e62e 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -80,12 +80,12 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 
*reo_timeout)
 */
remaining = tp->rack.rtt_us + reo_wnd -
tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
-   if (remaining < 0) {
+   if (remaining <= 0) {
tcp_rack_mark_skb_lost(sk, skb);
list_del_init(>tcp_tsorted_anchor);
} else {
-   /* Record maximum wait time (+1 to avoid 0) */
-   *reo_timeout = max_t(u32, *reo_timeout, 1 + remaining);
+   /* Record maximum wait time */
+   *reo_timeout = max_t(u32, *reo_timeout, remaining);
}
}
 }
-- 
2.15.1.424.g9478a66081-goog

1 2 3 4 >

1 - 100 of 344 matches

Mail list logo