Re: [PATCH net] tcp: fix keepalive when data remain undelivered

2021-02-19 Thread Yuchung Cheng
On Fri, Feb 19, 2021 at 4:54 PM Enke Chen  wrote:
>
> From: Enke Chen 
>
> TCP keepalive does not timeout under the condition that network connection
> is lost and data remain undelivered (incl. retransmit). A very simple
> scenarios of the failure is to write data to a tcp socket after the network
> connection is lost.

AFAIK current Linux TCP KA implementation does not violate the
RFC793bis (Section 3.7.4 Keep-Alives)
https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-20#section-3.7.4

We have even raised that in IETF tcpm list to get more clarity
https://mailarchive.ietf.org/arch/msg/tcpm/KxcEsLtDuDNhcP8UjbyPkJqpzsE/

I believe you interpret the keep-alive differently -- so this is
arguably a subjective "bug-fix". As Neal and I have expressed in our
private communications, current Linux KA has been implemented for more
than a decade. Changing this behavior may break many existing
applications even if it may benefit certain.

>
> Under the specified condition the keepalive timeout is not evaluated in
> the keepalive timer. That is the primary cause of the failure. In addition,
> the keepalive probe is not sent out in the keepalive timer. Although packet
> retransmit or 0-window probe can serve a similar purpose, they have their
> own timers and backoffs that are generally not aligned with the keepalive
> parameters for probes and timeout.
>
> As the timing and conditions of the events involved are random, the tcp
> keepalive can fail randomly. Given the randomness of the failures, fixing
> the issue would not cause any backward compatibility issues. As was well
> said, "Determinism is a special case of randomness".
>
> The fix in this patch consists of the following:
>
>   a. Always evaluate the keepalive timeout in the keepalive timer.
>
>   b. Always send out the keepalive probe in the keepalive timer (post the
>  keepalive idle time). Given that the keepalive intervals are usually
>  in the range of 30 - 60 seconds, there is no need for an optimization
>  to further reduce the number of keepalive probes in the presence of
>  packet retransmit.
>
>   c. Use the elapsed time (instead of the 0-window probe counter) in
>  evaluating tcp keepalive timeout.
>
> Thanks to Eric Dumazet, Neal Cardwell, and Yuchung Cheng for helpful
> discussions about the issue and options for fixing it.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2 Initial git repository build")
> Signed-off-by: Enke Chen 
> ---
>  net/ipv4/tcp_timer.c | 20 ++--
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 4ef08079ccfa..16a044da20db 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -708,29 +708,23 @@ static void tcp_keepalive_timer (struct timer_list *t)
> ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
> goto out;
>
> -   elapsed = keepalive_time_when(tp);
> -
> -   /* It is alive without keepalive 8) */
> -   if (tp->packets_out || !tcp_write_queue_empty(sk))
> -   goto resched;
> -
> elapsed = keepalive_time_elapsed(tp);
>
> if (elapsed >= keepalive_time_when(tp)) {
> /* If the TCP_USER_TIMEOUT option is enabled, use that
>  * to determine when to timeout instead.
>  */
> -   if ((icsk->icsk_user_timeout != 0 &&
> -   elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
> -   icsk->icsk_probes_out > 0) ||
> -   (icsk->icsk_user_timeout == 0 &&
> -   icsk->icsk_probes_out >= keepalive_probes(tp))) {
> +   u32 timeout = icsk->icsk_user_timeout ?
> + msecs_to_jiffies(icsk->icsk_user_timeout) :
> + keepalive_intvl_when(tp) * keepalive_probes(tp) +
> + keepalive_time_when(tp);
> +
> +   if (elapsed >= timeout) {
> tcp_send_active_reset(sk, GFP_ATOMIC);
> tcp_write_err(sk);
> goto out;
> }
> if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
> -   icsk->icsk_probes_out++;
> elapsed = keepalive_intvl_when(tp);
> } else {
> /* If keepalive was lost due to local congestion,
> @@ -744,8 +738,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
> }
>
> sk_mem_reclaim(sk);
> -
> -resched:
> inet_csk_reset_keepalive_timer (sk, elapsed);
> goto out;
>
> --
> 2.29.2
>


Re: [PATCH] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-13 Thread Yuchung Cheng
On Wed, Jan 13, 2021 at 12:49 PM Eric Dumazet  wrote:
>
> On Wed, Jan 13, 2021 at 9:12 PM Enke Chen  wrote:
> >
> > From: Enke Chen 
> >
> > The TCP session does not terminate with TCP_USER_TIMEOUT when data
> > remain untransmitted due to zero window.
> >
> > The number of unanswered zero-window probes (tcp_probes_out) is
> > reset to zero with incoming acks irrespective of the window size,
> > as described in tcp_probe_timer():
> >
> > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> > as long as the receiver continues to respond probes. We support
> > this by default and reset icsk_probes_out with incoming ACKs.
> >
> > This counter, however, is the wrong one to be used in calculating the
> > duration that the window remains closed and data remain untransmitted.
> > Thanks to Jonathan Maxwell  for diagnosing the
> > actual issue.
> >
> > In this patch a separate counter is introduced to track the number of
> > zero-window probes that are not answered with any non-zero window ack.
> > This new counter is used in determining when to abort the session with
> > TCP_USER_TIMEOUT.
> >
>
> I think one possible issue would be that local congestion (full qdisc)
> would abort early,
> because tcp_model_timeout() assumes linear backoff.
Yes exactly. if ZWPs are dropped due to local congestion, the
model_timeout computes incorrectly. Therefore having a starting
timestamp is the surest way b/c it does not assume any specific
backoff behavior.

>
> Neal or Yuchung can further comment on that, it is late for me in France.
>
> packetdrill test would be :
>
>0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
>
>
>+0 < S 0:0(0) win 0 
>+0 > S. 0:0(0) ack 1 
>
>   +.1 < . 1:1(0) ack 1 win 65530
>+0 accept(3, ..., ...) = 4
>
>+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
>+0 write(4, ..., 24) = 24
>+0 > P. 1:25(24) ack 1
>+.1 < . 1:1(0) ack 25 win 65530
>+0 %{ assert tcpi_probes == 0, tcpi_probes; \
>  assert tcpi_backoff == 0, tcpi_backoff }%
>
> // install a qdisc dropping all packets
>+0 `tc qdisc delete dev tun0 root 2>/dev/null ; tc qdisc add dev
> tun0 root pfifo limit 0`
>+0 write(4, ..., 24) = 24
>// When qdisc is congested we retry every 500ms therefore in theory
>// we'd retry 6 times before hitting 3s timeout. However, since we
>// estimate the elapsed time based on exp backoff of actual RTO (300ms),
>// we'd bail earlier with only 3 probes.
>+2.1 write(4, ..., 24) = -1
>+0 %{ assert tcpi_probes == 3, tcpi_probes; \
>  assert tcpi_backoff == 0, tcpi_backoff }%
>+0 close(4) = 0
>
> > Cc: sta...@vger.kernel.org
> > Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> > Reported-by: William McCall 
> > Signed-off-by: Enke Chen 
> > ---
> >  include/linux/tcp.h   | 5 +
> >  net/ipv4/tcp.c| 1 +
> >  net/ipv4/tcp_input.c  | 3 ++-
> >  net/ipv4/tcp_output.c | 2 ++
> >  net/ipv4/tcp_timer.c  | 5 +++--
> >  5 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index 2f87377e9af7..c9415b30fa67 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -352,6 +352,11 @@ struct tcp_sock {
> >
> > int linger2;
> >
> > +   /* While icsk_probes_out is for unanswered 0 window probes, this
> > +* counter is for 0-window probes that are not answered with any
> > +* non-zero window (nzw) acks.
> > +*/
> > +   u8  probes_nzw;
> >
> >  /* Sock_ops bpf program related variables */
> >  #ifdef CONFIG_BPF
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index ed42d2193c5c..af6a41a5a5ac 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -2940,6 +2940,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> > icsk->icsk_rto = TCP_TIMEOUT_INIT;
> > icsk->icsk_rto_min = TCP_RTO_MIN;
> > icsk->icsk_delack_max = TCP_DELACK_MAX;
> > +   tp->probes_nzw = 0;
> > tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
> > tp->snd_cwnd = TCP_INIT_CWND;
> > tp->snd_cwnd_cnt = 0;
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index c7e16b0ed791..4812a969c18a 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3377,13 +3377,14 @@ static void tcp_ack_probe(struct sock *sk)
> >  {
> > struct inet_connection_sock *icsk = inet_csk(sk);
> > struct sk_buff *head = tcp_send_head(sk);
> > -   const struct tcp_sock *tp = tcp_sk(sk);
> > +   struct tcp_sock *tp = tcp_sk(sk);
> >
> > /* Was it a usable window open? */
> > if (!head)
> > return;
> > if (!after(TCP_SKB_CB(head)->end_seq, tcp_wnd_end(tp))) {
> > icsk->icsk_backoff = 0;
> > +   tp->probes_nzw = 0;
> >

Re: [PATCH] tcp: keepalive fixes

2021-01-12 Thread Yuchung Cheng
On Tue, Jan 12, 2021 at 2:31 PM Enke Chen  wrote:
>
> From: Enke Chen 
>
> In this patch two issues with TCP keepalives are fixed:
>
> 1) TCP keepalive does not timeout when there are data waiting to be
>delivered and then the connection got broken. The TCP keepalive
>timeout is not evaluated in that condition.
hi enke
Do you have an example to demonstrate this issue -- in theory when
there is data inflight, an RTO timer should be pending (which
considers user-timeout setting). based on the user-timeout description
(man tcp), the user timeout should abort the socket per the specified
time after data commences. some data would help to understand the
issue.

>
>The fix is to remove the code that prevents TCP keepalive from
>being evaluated for timeout.
>
> 2) With the fix for #1, TCP keepalive can erroneously timeout after
>the 0-window probe kicks in. The 0-window probe counter is wrongly
>applied to TCP keepalives.
>
>The fix is to use the elapsed time instead of the 0-window probe
>counter in evaluating TCP keepalive timeout.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Enke Chen 
> ---
>  net/ipv4/tcp_timer.c | 15 +++
>  1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 6c62b9ea1320..40953aa40d53 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -696,12 +696,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
> ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
> goto out;
>
> -   elapsed = keepalive_time_when(tp);
> -
> -   /* It is alive without keepalive 8) */
> -   if (tp->packets_out || !tcp_write_queue_empty(sk))
> -   goto resched;
> -
> elapsed = keepalive_time_elapsed(tp);
>
> if (elapsed >= keepalive_time_when(tp)) {
> @@ -709,16 +703,15 @@ static void tcp_keepalive_timer (struct timer_list *t)
>  * to determine when to timeout instead.
>  */
> if ((icsk->icsk_user_timeout != 0 &&
> -   elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
> -   icsk->icsk_probes_out > 0) ||
> +elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout)) ||
> (icsk->icsk_user_timeout == 0 &&
> -   icsk->icsk_probes_out >= keepalive_probes(tp))) {
> +(elapsed >= keepalive_time_when(tp) +
> + keepalive_intvl_when(tp) * keepalive_probes(tp {
> tcp_send_active_reset(sk, GFP_ATOMIC);
> tcp_write_err(sk);
> goto out;
> }
> if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
> -   icsk->icsk_probes_out++;
> elapsed = keepalive_intvl_when(tp);
> } else {
> /* If keepalive was lost due to local congestion,
> @@ -732,8 +725,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
> }
>
> sk_mem_reclaim(sk);
> -
> -resched:
> inet_csk_reset_keepalive_timer (sk, elapsed);
> goto out;
>
> --
> 2.29.2
>


Re: [net-next PATCH v3] tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit

2020-12-14 Thread Yuchung Cheng
On Mon, Dec 14, 2020 at 9:42 AM Eric Dumazet  wrote:
>
> On Sat, Dec 12, 2020 at 9:31 PM Alexander Duyck
>  wrote:
> >
> > From: Alexander Duyck 
> >
> > There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> > message in the case of IPv6 or a fragmentation request in the case of
> > IPv4. This results in the socket stalling for a second or more as it does
> > not respond to the message by retransmitting the SYN frame.
> >
> > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > makes use of the entire MSS. In the case of fastopen it does, and an
> > additional complication is that the retransmit queue doesn't contain the
> > original frames. As a result when tcp_simple_retransmit is called and
> > walks the list of frames in the queue it may not mark the frames as lost
> > because both the SYN and the data packet each individually are smaller than
> > the MSS size after the adjustment. This results in the socket being stalled
> > until the retransmit timer kicks in and forces the SYN frame out again
> > without the data attached.
> >
> > In order to resolve this we can reduce the MSS the packets are compared
> > to in tcp_simple_retransmit to -1 for cases where we are still in the
> > TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of
> > the packets related to the fastopen SYN as lost.
> >
> > Signed-off-by: Alexander Duyck 
> > ---
> >
>
> SGTM, thanks !
>
> Signed-off-by: Eric Dumazet 
Nice work. I tested and verified it works with our packetdrill

Signed-off-by: Yuchung Cheng 

>
> > v2: Changed logic to invalidate all retransmit queue frames if fastopen SYN
> > v3: Updated commit message to reflect actual solution in 3rd paragraph
> >


Re: [net-next PATCH] tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit

2020-12-12 Thread Yuchung Cheng
On Sat, Dec 12, 2020 at 11:01 AM Alexander Duyck
 wrote:
>
> On Sat, Dec 12, 2020 at 10:34 AM Yuchung Cheng  wrote:
> >
> > On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
> >  wrote:
> > >
> > > From: Alexander Duyck 
> > >
> > > There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> > > message in the case of IPv6 or a fragmentation request in the case of
> > > IPv4. This results in the socket stalling for a second or more as it does
> > > not respond to the message by retransmitting the SYN frame.
> > >
> > > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > > makes use of the entire MSS. In the case of fastopen it does, and an
> > > additional complication is that the retransmit queue doesn't contain the
> > > original frames. As a result when tcp_simple_retransmit is called and
> > > walks the list of frames in the queue it may not mark the frames as lost
> > > because both the SYN and the data packet each individually are smaller 
> > > than
> > > the MSS size after the adjustment. This results in the socket being 
> > > stalled
> > > until the retransmit timer kicks in and forces the SYN frame out again
> > > without the data attached.
> > >
> > > In order to resolve this we can generate our best estimate for the 
> > > original
> > > packet size by detecting the fastopen SYN frame and then adding the
> > > overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> > > have exceeded the MSS. If so we can mark the frame as lost and retransmit
> > > it.
> > >
> > > Signed-off-by: Alexander Duyck 
> > > ---
> > >  net/ipv4/tcp_input.c |   30 +++---
> > >  1 file changed, 27 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 9e8a6c1aa019..79375b58de84 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock 
> > > *sk)
> > >  void tcp_simple_retransmit(struct sock *sk)
> > >  {
> > > const struct inet_connection_sock *icsk = inet_csk(sk);
> > > +   struct sk_buff *skb = tcp_rtx_queue_head(sk);
> > > struct tcp_sock *tp = tcp_sk(sk);
> > > -   struct sk_buff *skb;
> > > -   unsigned int mss = tcp_current_mss(sk);
> > > +   unsigned int mss;
> > > +
> > > +   /* A fastopen SYN request is stored as two separate packets within
> > > +* the retransmit queue, this is done by tcp_send_syn_data().
> > > +* As a result simply checking the MSS of the frames in the queue
> > > +* will not work for the SYN packet. So instead we must make a 
> > > best
> > > +* effort attempt by validating the data frame with the mss size
> > > +* that would be computed now by tcp_send_syn_data and comparing
> > > +* that against the data frame that would have been included with
> > > +* the SYN.
> > > +*/
> > > +   if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> > > +   struct sk_buff *syn_data = skb_rb_next(skb);
> > > +
> > > +   mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> > > + tp->tcp_header_len - sizeof(struct tcphdr) -
> > > + MAX_TCP_OPTION_SPACE;
> > nice comment! The original syn_data mss needs to be inferred which is
> > a hassle to get right. my sense is path-mtu issue is enough to warrant
> > they are lost.
> > I suggest simply mark syn & its data lost if tcp_simple_retransmit is
> > called during TFO handshake, i.e.
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 62f7aabc7920..7f0c4f2947eb 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
> > unsigned int mss = tcp_current_mss(sk);
> >
> > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> > -   if (tcp_skb_seglen(skb) > mss)
> > +   if (tcp_skb_seglen(skb) > mss ||
> > +   (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
> > tcp_m

Re: [net-next PATCH] tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit

2020-12-12 Thread Yuchung Cheng
On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
 wrote:
>
> From: Alexander Duyck 
>
> There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> message in the case of IPv6 or a fragmentation request in the case of
> IPv4. This results in the socket stalling for a second or more as it does
> not respond to the message by retransmitting the SYN frame.
>
> Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> makes use of the entire MSS. In the case of fastopen it does, and an
> additional complication is that the retransmit queue doesn't contain the
> original frames. As a result when tcp_simple_retransmit is called and
> walks the list of frames in the queue it may not mark the frames as lost
> because both the SYN and the data packet each individually are smaller than
> the MSS size after the adjustment. This results in the socket being stalled
> until the retransmit timer kicks in and forces the SYN frame out again
> without the data attached.
>
> In order to resolve this we can generate our best estimate for the original
> packet size by detecting the fastopen SYN frame and then adding the
> overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> have exceeded the MSS. If so we can mark the frame as lost and retransmit
> it.
>
> Signed-off-by: Alexander Duyck 
> ---
>  net/ipv4/tcp_input.c |   30 +++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9e8a6c1aa019..79375b58de84 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
>  void tcp_simple_retransmit(struct sock *sk)
>  {
> const struct inet_connection_sock *icsk = inet_csk(sk);
> +   struct sk_buff *skb = tcp_rtx_queue_head(sk);
> struct tcp_sock *tp = tcp_sk(sk);
> -   struct sk_buff *skb;
> -   unsigned int mss = tcp_current_mss(sk);
> +   unsigned int mss;
> +
> +   /* A fastopen SYN request is stored as two separate packets within
> +* the retransmit queue, this is done by tcp_send_syn_data().
> +* As a result simply checking the MSS of the frames in the queue
> +* will not work for the SYN packet. So instead we must make a best
> +* effort attempt by validating the data frame with the mss size
> +* that would be computed now by tcp_send_syn_data and comparing
> +* that against the data frame that would have been included with
> +* the SYN.
> +*/
> +   if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> +   struct sk_buff *syn_data = skb_rb_next(skb);
> +
> +   mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> + tp->tcp_header_len - sizeof(struct tcphdr) -
> + MAX_TCP_OPTION_SPACE;
nice comment! The original syn_data mss needs to be inferred which is
a hassle to get right. my sense is path-mtu issue is enough to warrant
they are lost.
I suggest simply mark syn & its data lost if tcp_simple_retransmit is
called during TFO handshake, i.e.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 62f7aabc7920..7f0c4f2947eb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
unsigned int mss = tcp_current_mss(sk);

skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
-   if (tcp_skb_seglen(skb) > mss)
+   if (tcp_skb_seglen(skb) > mss ||
+   (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
tcp_mark_skb_lost(sk, skb);
}

We have a TFO packetdrill test that verifies my suggested fix should
trigger an immediate retransmit vs 1s wait.




>
> -   skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> +   if (syn_data && syn_data->len > mss)
> +   tcp_mark_skb_lost(sk, skb);
> +
> +   skb = syn_data;
> +   } else {
> +   mss = tcp_current_mss(sk);
> +   }
> +
> +   skb_rbtree_walk_from(skb) {
> if (tcp_skb_seglen(skb) > mss)
> tcp_mark_skb_lost(sk, skb);
> }
>
>


Re: [net PATCH] tcp: Mark fastopen SYN packet as lost when receiving ICMP_TOOBIG/ICMP_FRAG_NEEDED

2020-12-11 Thread Yuchung Cheng
On Fri, Dec 11, 2020 at 1:51 PM Alexander Duyck
 wrote:
>
> On Fri, Dec 11, 2020 at 11:18 AM Eric Dumazet  wrote:
> >
> > On Fri, Dec 11, 2020 at 6:15 PM Alexander Duyck
> >  wrote:
> > >
> > > On Fri, Dec 11, 2020 at 8:22 AM Eric Dumazet  wrote:
> > > >
> > > > On Fri, Dec 11, 2020 at 5:03 PM Alexander Duyck
> > > >  wrote:
> > > >
> > > > > That's fine. I can target this for net-next. I had just selected net
> > > > > since I had considered it a fix, but I suppose it could be considered
> > > > > a behavioral change.
> > > >
> > > > We are very late in the 5.10 cycle, and we never handled ICMP in this
> > > > state, so net-next is definitely better.
> > > >
> > > > Note that RFC 7413 states in 4.1.3 :
> > > >
> > > >  The client MUST cache cookies from servers for later Fast Open
> > > >connections.  For a multihomed client, the cookies are dependent on
> > > >the client and server IP addresses.  Hence, the client should cache
> > > >at most one (most recently received) cookie per client and server IP
> > > >address pair.
> > > >
> > > >When caching cookies, we recommend that the client also cache the
> > > >Maximum Segment Size (MSS) advertised by the server.  The client can
> > > >cache the MSS advertised by the server in order to determine the
> > > >maximum amount of data that the client can fit in the SYN packet in
> > > >subsequent TFO connections.  Caching the server MSS is useful
> > > >because, with Fast Open, a client sends data in the SYN packet before
> > > >the server announces its MSS in the SYN-ACK packet.  If the client
> > > >sends more data in the SYN packet than the server will accept, this
> > > >will likely require the client to retransmit some or all of the data.
> > > >Hence, caching the server MSS can enhance performance.
> > > >
> > > >Without a cached server MSS, the amount of data in the SYN packet is
> > > >limited to the default MSS of 536 bytes for IPv4 [RFC1122] and 1220
> > > >bytes for IPv6 [RFC2460].  Even if the client complies with this
> > > >limit when sending the SYN, it is known that an IPv4 receiver
> > > >advertising an MSS less than 536 bytes can receive a segment larger
> > > >than it is expecting.
> > > >
> > > >If the cached MSS is larger than the typical size (1460 bytes for
> > > >IPv4 or 1440 bytes for IPv6), then the excess data in the SYN packet
> > > >may cause problems that offset the performance benefit of Fast Open.
> > > >For example, the unusually large SYN may trigger IP fragmentation and
> > > >may confuse firewalls or middleboxes, causing SYN retransmission and
> > > >other side effects.  Therefore, the client MAY limit the cached MSS
> > > >to 1460 bytes for IPv4 or 1440 for IPv6.
> > > >
> > > >
> > > > Relying on ICMP is fragile, since they can be filtered in some way.
> > >
> > > In this case I am not relying on the ICMP, but thought that since I
> > > have it I should make use of it. WIthout the ICMP we would still just
> > > be waiting on the retransmit timer.
> > >
> > > The problem case has a v6-in-v6 tunnel between the client and the
> > > endpoint so both ends assume an MTU 1500 and advertise a 1440 MSS
> > > which works fine until they actually go to send a large packet between
> > > the two. At that point the tunnel is triggering an ICMP_TOOBIG and the
> > > endpoint is stalling since the MSS is dropped to 1400, but the SYN and
> > > data payload were already smaller than that so no retransmits are
> > > being triggered. This results in TFO being 1s slower than non-TFO
> > > because of the failure to trigger the retransmit for the frame that
> > > violated the PMTU. The patch is meant to get the two back into
> > > comparable times.
> >
> > Okay... Have you studied why tcp_v4_mtu_reduced() (and IPv6 equivalent)
> > code does not yet handle the retransmit in TCP_SYN_SENT state ?
>
> The problem lies in tcp_simple_retransmit(). Specifically the loop at
> the start of the function goes to check the retransmit queue to see if
> there are any packets larger than MSS and finds none since we don't
> place the SYN w/ data in there and instead have a separate SYN and
> data packet.
>
> I'm debating if I should take an alternative approach and modify the
> loop at the start of tcp_simple_transmit to add a check for a SYN
> packet, tp->syn_data being set, and then comparing the next frame
> length + MAX_TCP_HEADER_OPTIONS versus mss.
Thanks for bringing up this tricky issue. The root cause seems to be
the special arrangement of storing SYN-data as one-(pure)-SYN and one
non-SYN data segment. Given tcp_simple_transmit probably is not called
frequently, your alternative approach sounds more appealing to me.

Replacing that strange syn|data arrangement for TFO has been on my
wish list for a long time... Ideally it's better to just store the
SYN+data and just carve out the SYN for retransmit.


Re: [PATCH 4.19 000/139] 4.19.7-stable review

2018-12-05 Thread Yuchung Cheng
On Wed, Dec 5, 2018 at 4:08 AM Rafael David Tinoco
 wrote:
>
> On 12/5/18 4:58 AM, Greg Kroah-Hartman wrote:
> > On Tue, Dec 04, 2018 at 07:09:46PM -0200, Rafael David Tinoco wrote:
> >> On 12/4/18 8:48 AM, Greg Kroah-Hartman wrote:
> >>> This is the start of the stable review cycle for the 4.19.7 release.
> >>> There are 139 patches in this series, all will be posted as a response
> >>> to this one.  If anyone has any issues with these being applied, please
> >>> let me know.
> >>>
> >>> Responses should be made by Thu Dec  6 10:36:22 UTC 2018.
> >>> Anything received after that time might be too late.
> >>>
> >>> The whole patch series can be found in one patch at:
> >>> 
> >>> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.7-rc1.gz
> >>> or in the git tree and branch at:
> >>> 
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> >>> linux-4.19.y
> >>> and the diffstat can be found below.
> >>>
> >>> thanks,
> >>>
> >>> greg k-h
> >>
> >> During functional tests for this v4.19 release, we faced a PANIC,
> >> described bellow, but unlikely related to this specific v4.19 version.
> >>
> >> First a WARN() at tcp_output.c:
> >>
> >> tcp_send_loss_probe():
> >> ...
> >>  /* Retransmit last segment. */
> >>  if (WARN_ON(!skb))
> >>  goto rearm_timer;
> >> ...
> >>
> >> [  173.557528] WARNING: CPU: 1 PID: 0 at
> >> /srv/oe/build/tmp-rpb-glibc/work-shared/juno/kernel-source/net/ipv4/tcp_output.c:2485
> >> tcp_send_loss_probe+0x164/0x1e8
> >> [  173.571425] Modules linked in: crc32_ce crct10dif_ce fuse
> >> [  173.576804] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.7-rc1 #1
> >> [  173.583014] Hardware name: ARM Juno development board (r2) (DT)
> >
> > So only this one machine saw this failure?
> >
> > If you can reproduce it again, bisection would be great to do if
> > possible.
>
> Yes, the only machine... I'm afraid this issue is intermittent and
> depends on TCP Tail Loss and a specific race causing the NULL
> dereference, so bisection would be tricky since it has happened
> independently of the functional test that was running. I have also
> copied authors for the Tail Loss code to check if they got any clues
> even without KASAN data.
There cause is an inconsistency of packet accounting: TCP
retransmission queue is empty but tp->packets_out is not zero. We will
send a fix soon. Thanks.

>
> Thank you,
> -
> Rafael D. Tinoco
>


Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-28 Thread Yuchung Cheng
On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko
 wrote:
> Hi.
>
> Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it
> intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix
> warning in tcp_fastretrans_alert() for me.

Hi Oleksandr: no retrans_collapse should not matter for that warning
in tcp_fstretrans_alert(). the warning as I explained earlier is
likely false. Neal and I are more concerned the panic in
tcp_sacktag_walk. This is just a blind shot but thx for retrying.

We can submit a one-liner to remove the fast retrans warning but want
to nail the bigger issue first.

>
> On středa 27. září 2017 2:18:32 CEST Yuchung Cheng wrote:
>> On Tue, Sep 26, 2017 at 5:12 PM, Yuchung Cheng  wrote:
>> > On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin  wrote:
>> >>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin  wrote:
>> >>> > > Hello.
>> >>> > >
>> >>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting
>> >>> > > in the
>> >>> > > warning shown below. Most of the time it is harmless, but rarely it
>> >>> > > just
>> >>> > > causes either freeze or (I believe, this is related too) panic in
>> >>> > > tcp_sacktag_walk() (because sk_buff passed to this function is
>> >>> > > NULL).
>> >>> > > Unfortunately, I still do not have proper stacktrace from panic, but
>> >>> > > will try to capture it if possible.
>> >>> > >
>> >>> > > Also, I have custom settings regarding TCP stack, shown below as
>> >>> > > well. ifb is used to shape traffic with tc.
>> >>> > >
>> >>> > > Please note this regression was already reported as BZ [1] and as a
>> >>> > > letter to ML [2], but got neither attention nor resolution. It is
>> >>> > > reproducible for (not only) me on my home router since v4.11 till
>> >>> > > v4.13.1 incl.
>> >>> > >
>> >>> > > Please advise on how to deal with it. I'll provide any additional
>> >>> > > info if
>> >>> > > necessary, also ready to test patches if any.
>> >>> > >
>> >>> > > Thanks.
>> >>> > >
>> >>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
>> >>> > > [2]
>> >>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.ne
>> >>> > > t_lists_netdev_msg436158.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJ
>> >>> > > YgtDM7QT-W-Fz_d29HYQ&m=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA&s
>> >>> > > =-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc&e=>>> >
>> >>> > We're experiencing the same problems on some machines in our fleet.
>> >>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and
>> >>> > sometimes panics in tcp_sacktag_walk().
>> >>
>> >>> > Here is an example of a backtrace with the panic log:
>> >> Hi Yuchung!
>> >>
>> >>> do you still see the panics if you disable RACK?
>> >>> sysctl net.ipv4.tcp_recovery=0?
>> >>
>> >> No, we haven't seen any crash since that.
>> >
>> > I am out of ideas how RACK can potentially cause tcp_sacktag_walk to
>> > take an empty skb :-( Do you have stack trace or any hint on which call
>> > to tcp-sacktag_walk triggered the panic? internally at Google we never
>> > see that.
>>
>> hmm something just struck me: could you try
>> sysctl net.ipv4.tcp_recovery=1 net.ipv4.tcp_retrans_collapse=0
>> and see if kernel still panics on sack processing?
>>
>> >>> also have you experience any sack reneg? could you post the output of
>> >>> ' nstat |grep -i TCP' thanks
>> >>
>> >> hostnameTcpActiveOpens  22896800.0
>> >> hostnameTcpPassiveOpens 35927580.0
>> >> hostnameTcpAttemptFails 746910 0.0
>> >> hostnameTcpEstabResets  154988 0.0
>> >> hostnameTcpInSegs   162586782550.0
>> >> hostnameTcpOutSegs  469670116110.0
>> >&

Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-26 Thread Yuchung Cheng
On Tue, Sep 26, 2017 at 5:12 PM, Yuchung Cheng  wrote:
> On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin  wrote:
>>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin  wrote:
>>> >
>>> > > Hello.
>>> > >
>>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting in 
>>> > > the
>>> > > warning shown below. Most of the time it is harmless, but rarely it just
>>> > > causes either freeze or (I believe, this is related too) panic in
>>> > > tcp_sacktag_walk() (because sk_buff passed to this function is NULL).
>>> > > Unfortunately, I still do not have proper stacktrace from panic, but 
>>> > > will try
>>> > > to capture it if possible.
>>> > >
>>> > > Also, I have custom settings regarding TCP stack, shown below as well. 
>>> > > ifb is
>>> > > used to shape traffic with tc.
>>> > >
>>> > > Please note this regression was already reported as BZ [1] and as a 
>>> > > letter to
>>> > > ML [2], but got neither attention nor resolution. It is reproducible 
>>> > > for (not
>>> > > only) me on my home router since v4.11 till v4.13.1 incl.
>>> > >
>>> > > Please advise on how to deal with it. I'll provide any additional info 
>>> > > if
>>> > > necessary, also ready to test patches if any.
>>> > >
>>> > > Thanks.
>>> > >
>>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
>>> > > [2] 
>>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg436158.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA&s=-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc&e=
>>> >
>>> > We're experiencing the same problems on some machines in our fleet.
>>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and
>>> > sometimes panics in tcp_sacktag_walk().
>>> >
>>> > Here is an example of a backtrace with the panic log:
>>
>> Hi Yuchung!
>>
>>> do you still see the panics if you disable RACK?
>>> sysctl net.ipv4.tcp_recovery=0?
>>
>> No, we haven't seen any crash since that.
> I am out of ideas how RACK can potentially cause tcp_sacktag_walk to
> take an empty skb :-( Do you have stack trace or any hint on which call
> to tcp-sacktag_walk triggered the panic? internally at Google we never
> see that.
hmm something just struck me: could you try
sysctl net.ipv4.tcp_recovery=1 net.ipv4.tcp_retrans_collapse=0
and see if kernel still panics on sack processing?

>
>
>>
>>>
>>> also have you experience any sack reneg? could you post the output of
>>> ' nstat |grep -i TCP' thanks
>>
>> hostnameTcpActiveOpens  22896800.0
>> hostnameTcpPassiveOpens 35927580.0
>> hostnameTcpAttemptFails 746910 0.0
>> hostnameTcpEstabResets  154988 0.0
>> hostnameTcpInSegs   162586782550.0
>> hostnameTcpOutSegs  469670116110.0
>> hostnameTcpRetransSegs  13724310   0.0
>> hostnameTcpInErrs   2  0.0
>> hostnameTcpOutRsts  94187980.0
>> hostnameTcpExtEmbryonicRsts 2303   0.0
>> hostnameTcpExtPruneCalled   90192  0.0
>> hostnameTcpExtOfoPruned 57274  0.0
>> hostnameTcpExtOutOfWindowIcmps  3  0.0
>> hostnameTcpExtTW11647050.0
>> hostnameTcpExtTWRecycled2  0.0
>> hostnameTcpExtPAWSEstab 1590.0
>> hostnameTcpExtDelayedACKs   209207209  0.0
>> hostnameTcpExtDelayedACKLocked  508571 0.0
>> hostnameTcpExtDelayedACKLost17132480.0
>> hostnameTcpExtListenOverflows   6250.0
>> hostnameTcpExtListenDrops   6250.0
>> hostnameTcpExtTCPHPHits 9341188489 0.0
>> hostname 

Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-26 Thread Yuchung Cheng
On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin  wrote:
>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin  wrote:
>> >
>> > > Hello.
>> > >
>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting in 
>> > > the
>> > > warning shown below. Most of the time it is harmless, but rarely it just
>> > > causes either freeze or (I believe, this is related too) panic in
>> > > tcp_sacktag_walk() (because sk_buff passed to this function is NULL).
>> > > Unfortunately, I still do not have proper stacktrace from panic, but 
>> > > will try
>> > > to capture it if possible.
>> > >
>> > > Also, I have custom settings regarding TCP stack, shown below as well. 
>> > > ifb is
>> > > used to shape traffic with tc.
>> > >
>> > > Please note this regression was already reported as BZ [1] and as a 
>> > > letter to
>> > > ML [2], but got neither attention nor resolution. It is reproducible for 
>> > > (not
>> > > only) me on my home router since v4.11 till v4.13.1 incl.
>> > >
>> > > Please advise on how to deal with it. I'll provide any additional info if
>> > > necessary, also ready to test patches if any.
>> > >
>> > > Thanks.
>> > >
>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
>> > > [2] 
>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg436158.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA&s=-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc&e=
>> >
>> > We're experiencing the same problems on some machines in our fleet.
>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and
>> > sometimes panics in tcp_sacktag_walk().
>> >
>> > Here is an example of a backtrace with the panic log:
>
> Hi Yuchung!
>
>> do you still see the panics if you disable RACK?
>> sysctl net.ipv4.tcp_recovery=0?
>
> No, we haven't seen any crash since that.
I am out of ideas how RACK can potentially cause tcp_sacktag_walk to
take an empty skb :-( Do you have stack trace or any hint on which call
to tcp-sacktag_walk triggered the panic? internally at Google we never
see that.


>
>>
>> also have you experience any sack reneg? could you post the output of
>> ' nstat |grep -i TCP' thanks
>
> hostnameTcpActiveOpens  22896800.0
> hostnameTcpPassiveOpens 35927580.0
> hostnameTcpAttemptFails 746910 0.0
> hostnameTcpEstabResets  154988 0.0
> hostnameTcpInSegs   162586782550.0
> hostnameTcpOutSegs  469670116110.0
> hostnameTcpRetransSegs  13724310   0.0
> hostnameTcpInErrs   2  0.0
> hostnameTcpOutRsts  94187980.0
> hostnameTcpExtEmbryonicRsts 2303   0.0
> hostnameTcpExtPruneCalled   90192  0.0
> hostnameTcpExtOfoPruned 57274  0.0
> hostnameTcpExtOutOfWindowIcmps  3  0.0
> hostnameTcpExtTW11647050.0
> hostnameTcpExtTWRecycled2  0.0
> hostnameTcpExtPAWSEstab 1590.0
> hostnameTcpExtDelayedACKs   209207209  0.0
> hostnameTcpExtDelayedACKLocked  508571 0.0
> hostnameTcpExtDelayedACKLost17132480.0
> hostnameTcpExtListenOverflows   6250.0
> hostnameTcpExtListenDrops   6250.0
> hostnameTcpExtTCPHPHits 9341188489 0.0
> hostnameTcpExtTCPPureAcks   1434646465 0.0
> hostnameTcpExtTCPHPAcks 5733614672 0.0
> hostnameTcpExtTCPSackRecovery   32616980.0
> hostnameTcpExtTCPSACKReneging   12203  0.0
> hostnameTcpExtTCPSACKReorder433189 0.0
> hostnameTcpExtTCPTSReorder  22694  0.0
> hostnameTcpExtTCPFullUndo   45092  0.0
> hostnameTcpExtTCPPartialUndo22016  0.0
> hostnameTcpExtTCPLossUndo   21500400.0
> hostnameTcpExtTCPLostRetransmit 60119  0.0
> hostnameTcpExtTCPSackFailures   26267820.0
> hostnameTcpExtTCPLossFailures   182999 0.0
> hostnameTcpExtTCPFastRetrans43342750.0
> hostnameTcpExtTCPSlowStartRetrans   34533480.0
> hostnameTcpExtTCPTimeouts   10709970.0
> hostnameTcpExtTCPLossProbes 2633545

Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-21 Thread Yuchung Cheng
On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin  wrote:
>
> > Hello.
> >
> > Since, IIRC, v4.11, there is some regression in TCP stack resulting in the
> > warning shown below. Most of the time it is harmless, but rarely it just
> > causes either freeze or (I believe, this is related too) panic in
> > tcp_sacktag_walk() (because sk_buff passed to this function is NULL).
> > Unfortunately, I still do not have proper stacktrace from panic, but will 
> > try
> > to capture it if possible.
> >
> > Also, I have custom settings regarding TCP stack, shown below as well. ifb 
> > is
> > used to shape traffic with tc.
> >
> > Please note this regression was already reported as BZ [1] and as a letter 
> > to
> > ML [2], but got neither attention nor resolution. It is reproducible for 
> > (not
> > only) me on my home router since v4.11 till v4.13.1 incl.
> >
> > Please advise on how to deal with it. I'll provide any additional info if
> > necessary, also ready to test patches if any.
> >
> > Thanks.
> >
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
> > [2] https://www.spinics.net/lists/netdev/msg436158.html
>
> We're experiencing the same problems on some machines in our fleet.
> Exactly the same symptoms: tcp_fastretrans_alert() warnings and
> sometimes panics in tcp_sacktag_walk().
>
> Here is an example of a backtrace with the panic log:
do you still see the panics if you disable RACK?
sysctl net.ipv4.tcp_recovery=0?

also have you experience any sack reneg? could you post the output of
' nstat |grep -i TCP' thanks


>
> 978.210080]  fuse
> [973978.214099]  sg
> [973978.217789]  loop
> [973978.221829]  efivarfs
> [973978.226544]  autofs4
> [973978.231109] CPU: 12 PID: 3806320 Comm: ld:srv:W20 Tainted: GW 
>   4.11.3-7_fbk1_1174_ga56eebf #7
> [973978.250563] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM06 
>   10/28/2016
> [973978.266558] Call Trace:
> [973978.271615]  
> [973978.275817]  dump_stack+0x4d/0x70
> [973978.282626]  __warn+0xd3/0xf0
> [973978.288727]  warn_slowpath_null+0x1e/0x20
> [973978.296910]  tcp_fastretrans_alert+0xacf/0xbd0
> [973978.305974]  tcp_ack+0xbce/0x1390
> [973978.312770]  tcp_rcv_established+0x1ce/0x740
> [973978.321488]  tcp_v6_do_rcv+0x195/0x440
> [973978.329166]  tcp_v6_rcv+0x94c/0x9f0
> [973978.336323]  ip6_input_finish+0xea/0x430
> [973978.344330]  ip6_input+0x32/0xa0
> [973978.350968]  ? ip6_rcv_finish+0xa0/0xa0
> [973978.358799]  ip6_rcv_finish+0x4b/0xa0
> [973978.366289]  ipv6_rcv+0x2ec/0x4f0
> [973978.373082]  ? ip6_make_skb+0x1c0/0x1c0
> [973978.380919]  __netif_receive_skb_core+0x2d5/0x9a0
> [973978.390505]  __netif_receive_skb+0x16/0x70
> [973978.398875]  netif_receive_skb_internal+0x23/0x80
> [973978.408462]  napi_gro_receive+0x113/0x1d0
> [973978.416657]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [973978.426398]  mlx5e_poll_rx_cq+0x7c/0x7f0
> [973978.434405]  mlx5e_napi_poll+0x8c/0x380
> [973978.442238]  ? mlx5_cq_completion+0x54/0xb0
> [973978.450770]  net_rx_action+0x22e/0x380
> [973978.458447]  __do_softirq+0x106/0x2e8
> [973978.465950]  irq_exit+0xb0/0xc0
> [973978.472396]  do_IRQ+0x4f/0xd0
> [973978.478495]  common_interrupt+0x86/0x86
> [973978.486329] RIP: 0033:0x7f8dee58d8ae
> [973978.493642] RSP: 002b:7f8cb925f078 EFLAGS: 0206
> [973978.504251]  ORIG_RAX: ff5f
> [973978.512085] RAX: 7f8cb925f1a8 RBX: 4800 RCX: 
> 7f8764bd6a80
> [973978.526508] RDX: 01ba RSI: 7f7cb4ca3410 RDI: 
> 7f7cb4ca3410
> [973978.540927] RBP: 000d R08: 7f8764bd6b00 R09: 
> 7f8764bd6b80
> [973978.555347] R10: 2400 R11: 7f8dee58e240 R12: 
> d3273c84146e8c29
> [973978.569766] R13: 9dac83ddf04c235c R14: 7f7cb4ca33b0 R15: 
> 7f7cb4ca4f50
> [973978.584189]  
> [973978.588650] ---[ end trace 5d1c83e12a57d039 ]---
> [973995.178183] BUG: unable to handle kernel
> [973995.186385] NULL pointer dereference
> [973995.193692]  at 0028
> [973995.200323] IP: tcp_sacktag_walk+0x2b7/0x460
> [973995.209032] PGD 102d856067
> [973995.214789] PUD fded0d067
> [973995.220385] PMD 0
> [973995.224577]
> [973995.227732] [ cut here ]
> [973995.237128] Oops:  [#1] SMP
> [973995.243575] Modules linked in:
> [973995.249868]  mptctl
> [973995.254251]  mptbase
> [973995.258792]  xt_DSCP
> [973995.263331]  xt_set
> [973995.267698]  ip_set_hash_ip
> [973995.273452]  cls_u32
> [973995.277993]  sch_sfq
> [973995.282535]  cls_fw
> [973995.286904]  sch_htb
> [973995.291444]  mpt3sas
> [973995.295982]  raid_class
> [973995.301044]  ip6table_mangle
> [973995.306973]  iptable_mangle
> [973995.312726]  cls_bpf
> [973995.317268]  tcp_diag
> [973995.321983]  udp_diag
> [973995.326697]  inet_diag
> [973995.331585]  ip6table_filter
> [973995.337513]  xt_NFLOG
> [973995.342226]  nfnetlink_log
> [973995.347807]  xt_comment
> [973995.352866]  xt_statistic
> [973995.358276]  iptable_filter
> [973995.364029]  xt_mark
> [973995.368572]  sb_edac
> [973995.373113]  ed

Re: TCP reaching to maximum throughput after a long time

2016-04-12 Thread Yuchung Cheng
On Tue, Apr 12, 2016 at 2:40 PM, Ben Greear  wrote:
> On 04/12/2016 01:29 PM, Eric Dumazet wrote:
>>
>> On Tue, 2016-04-12 at 13:23 -0700, Ben Greear wrote:
>>
>>> It worked well enough for years that I didn't even know other algorithms
>>> were
>>> available.  It was broken around 4.0 time, and I reported it to the list,
>>> and no one seemed to really care enough to do anything about it.  I
>>> changed
>>> to reno and ignored the problem as well.
>>>
>>> It is trivially easy to see the regression when using ath10k NIC, and
>>> from this email
>>> thread, I guess other NICs have similar issues.
>>
>>
>> Since it is so trivial, why don't you start a bisection ?
>
>
> I vaguely remember doing a bisect, but I can't find any email about
> that, so maybe I didn't.  At any rate, it is somewhere between 3.17 and 4.0.
> From memory, it was between 3.19 and 4.0, but I am not certain of that.
>
> Neil's suggestion, from the thread below, is that it was likely:  "605ad7f
> tcp: refine TSO autosizing"
>
> Here is previous email thread:
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg80803.html
>
> This one has a link to a pcap I made at the time:
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg80890.html
based on the prev thread I propose we disable hystart ack-train. It is
brittle under various circumstances. We've disabled that at Google for
years.

>
>>
>> I asked a capture, I did not say ' switch to Reno or whatever ', right ?
>>
>> Guessing is nice, but investigating and fixing is better.
>>
>> Do not assume that nothing can be done, please ?
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>


Re: TCP reaching to maximum throughput after a long time

2016-04-12 Thread Yuchung Cheng
On Tue, Apr 12, 2016 at 7:52 AM, Eric Dumazet  wrote:
>
> On Tue, 2016-04-12 at 12:17 +, Machani, Yaniv wrote:
> > Hi,
> > After updating from Kernel 3.14 to Kernel 4.4 we have seen a TCP 
> > performance degradation over Wi-Fi.
> > In 3.14 kernel, TCP got to its max throughout after less than a second, 
> > while in the 4.4  it is taking ~20-30 seconds.
> > UDP TX/RX and TCP RX performance is as expected.
> > We are using a Beagle Bone Black and a WiLink8 device.
> >
> > Were there any related changes that might cause such behavior ?
> > Kernel configuration and sysctl values were compared, but no significant 
> > differences have been found.
> >
> > See a log of the behavior below :
> > ---
> > Client connecting to 10.2.46.5, TCP port 5001
> > TCP window size:  320 KByte (WARNING: requested  256 KByte)
> > 
> > [  3] local 10.2.46.6 port 49282 connected with 10.2.46.5 port 5001
> > [ ID] Interval   Transfer Bandwidth
> > [  3]  0.0- 1.0 sec  5.75 MBytes  48.2 Mbits/sec
> > [  3]  1.0- 2.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  2.0- 3.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  3.0- 4.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  4.0- 5.0 sec  6.75 MBytes  56.6 Mbits/sec
> > [  3]  5.0- 6.0 sec  3.38 MBytes  28.3 Mbits/sec
> > [  3]  6.0- 7.0 sec  6.38 MBytes  53.5 Mbits/sec
> > [  3]  7.0- 8.0 sec  6.88 MBytes  57.7 Mbits/sec
> > [  3]  8.0- 9.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3]  9.0-10.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 10.0-11.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 11.0-12.0 sec  7.25 MBytes  60.8 Mbits/sec
> > [  3] 12.0-13.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 13.0-14.0 sec  7.25 MBytes  60.8 Mbits/sec
> > [  3] 14.0-15.0 sec  7.62 MBytes  64.0 Mbits/sec
> > [  3] 15.0-16.0 sec  7.88 MBytes  66.1 Mbits/sec
> > [  3] 16.0-17.0 sec  8.12 MBytes  68.2 Mbits/sec
> > [  3] 17.0-18.0 sec  8.25 MBytes  69.2 Mbits/sec
> > [  3] 18.0-19.0 sec  8.50 MBytes  71.3 Mbits/sec
> > [  3] 19.0-20.0 sec  8.88 MBytes  74.4 Mbits/sec
> > [  3] 20.0-21.0 sec  8.75 MBytes  73.4 Mbits/sec
> > [  3] 21.0-22.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 22.0-23.0 sec  8.75 MBytes  73.4 Mbits/sec
> > [  3] 23.0-24.0 sec  8.50 MBytes  71.3 Mbits/sec
> > [  3] 24.0-25.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 25.0-26.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 26.0-27.0 sec  8.62 MBytes  72.4 Mbits/sec
> >
>
> CC netdev, where this is better discussed.
>
> This could be a lot of different factors, and caused by a sender
> problem, a receiver problem, ...
>
> TCP behavior depends on the drivers, so maybe a change there can explain
> this.
>
> You could capture the first 5000 frames of the flow and post the pcap ?
> (-s 128 to capture only the headers)
pcap would be really helpful indeed. if possible please capture on
both 4.4 and 3.14 kernels.

>
> tcpdump -p -s 128 -i eth0 -c 5000 host 10.2.46.5 -w flow.pcap
>
>
> Also, while test is running, you could fetch
> ss -temoi dst 10.2.46.5:5001
>
>
>
>
>


Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2016-01-06 Thread Yuchung Cheng
On Wed, Jan 6, 2016 at 10:19 AM, Yuchung Cheng  wrote:
> On Wed, Jan 6, 2016 at 8:50 AM, Oleksandr Natalenko
>  wrote:
>>
>> Unfortunately, the patch didn't help -- I've got the same stacktrace with 
>> slightly different offset (+3) within the function.
>>
>> Now trying to get full stacktrace via netconsole. Need more time.
>>
>> Meanwhile, any other ideas on what went wrong?
>
> That's odd b/c the patch already checks and avoids div0. Can post me
> the stacktrace and kernel warnings if any ...
>
> One possibility is that tcp_cwnd_reduction() may set a cwnd of 0,
> which then gets used to start another recovery phase. This may or may
> not be the culprit of this div0 issue because I wasn't able to
> reproduce exactly your issue on our servers. But I will post the fix
> today and CC you.
Could you turn off ecn (sysctl net.ipv4.tcp_ecn=0) to see if this still happen?

>
>>
>>
>> On December 22, 2015 4:10:32 AM EET, Yuchung Cheng  wrote:
>> >On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko
>> > wrote:
>> >> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB
>> >mode by
>> >> default and SS mode conditionally) introduced changes to
>> >net/ipv4/tcp_input.c
>> >> tcp_cwnd_reduction() that, possibly, cause division by zero, and
>> >therefore,
>> >> kernel panic in interrupt handler [1].
>> >>
>> >> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the
>> >issue.
>> >>
>> >> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day
>> >> (occasionally).
>> >>
>> >> What could be done to help in debugging this issue?
>> >Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)?
>> >
>> >If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR
>> >state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can
>> >you try this debug / quick-fix patch and send me the error message if
>> >any?
>> >
>> >
>> >>
>> >> Regards,
>> >>   Oleksandr.
>> >>
>> >> [1] http://i.piccy.info/
>> >>
>> >i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2016-01-06 Thread Yuchung Cheng
On Wed, Jan 6, 2016 at 8:50 AM, Oleksandr Natalenko
 wrote:
>
> Unfortunately, the patch didn't help -- I've got the same stacktrace with 
> slightly different offset (+3) within the function.
>
> Now trying to get full stacktrace via netconsole. Need more time.
>
> Meanwhile, any other ideas on what went wrong?

That's odd b/c the patch already checks and avoids div0. Can post me
the stacktrace and kernel warnings if any ...

One possibility is that tcp_cwnd_reduction() may set a cwnd of 0,
which then gets used to start another recovery phase. This may or may
not be the culprit of this div0 issue because I wasn't able to
reproduce exactly your issue on our servers. But I will post the fix
today and CC you.

>
>
> On December 22, 2015 4:10:32 AM EET, Yuchung Cheng  wrote:
> >On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko
> > wrote:
> >> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB
> >mode by
> >> default and SS mode conditionally) introduced changes to
> >net/ipv4/tcp_input.c
> >> tcp_cwnd_reduction() that, possibly, cause division by zero, and
> >therefore,
> >> kernel panic in interrupt handler [1].
> >>
> >> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the
> >issue.
> >>
> >> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day
> >> (occasionally).
> >>
> >> What could be done to help in debugging this issue?
> >Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)?
> >
> >If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR
> >state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can
> >you try this debug / quick-fix patch and send me the error message if
> >any?
> >
> >
> >>
> >> Regards,
> >>   Oleksandr.
> >>
> >> [1] http://i.piccy.info/
> >>
> >i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REGRESSION] tcp/ipv4: kernel panic because of (possible) division by zero

2015-12-21 Thread Yuchung Cheng
On Mon, Dec 21, 2015 at 12:25 PM, Oleksandr Natalenko
 wrote:
> Commit 3759824da87b30ce7a35b4873b62b0ba38905ef5 (tcp: PRR uses CRB mode by
> default and SS mode conditionally) introduced changes to net/ipv4/tcp_input.c
> tcp_cwnd_reduction() that, possibly, cause division by zero, and therefore,
> kernel panic in interrupt handler [1].
>
> Reverting 3759824da87b30ce7a35b4873b62b0ba38905ef5 seems to fix the issue.
>
> I'm able to reproduce the issue on 4.3.0–4.3.3 once per several day
> (occasionally).
>
> What could be done to help in debugging this issue?
Do you have ECN enabled (i.e. sysctl net.ipv4.tcp_ecn > 0)?

If so I suspect an ACK carrying ECE during CA_Loss causes entering CWR
state w/o calling tcp_init_cwnd_reduct() to set tp->prior_cwnd. Can
you try this debug / quick-fix patch and send me the error message if
any?


>
> Regards,
>   Oleksandr.
>
> [1] http://i.piccy.info/
> i9/6f5cb187c4ff282d189f78c63f95af43/1450729403/283985/951663/panic.jpg


0001-tcp-debug-tcp_cwnd_reduction-div0.patch
Description: Binary data


Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-10-26 Thread Yuchung Cheng
On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund  wrote:
>
>
> > On 26 Oct 2015, at 15:50, Neal Cardwell  wrote:
> >
> > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> >  wrote:
> >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, int 
> >> level,
> > ...
> >> +   case TCP_RDB:
> >> +   if (val < 0 || val > 1) {
> >> +   err = -EINVAL;
> >> +   } else {
> >> +   tp->rdb = val;
> >> +   tp->nonagle = val;
> >
> > The semantics of the tp->nonagle bits are already a bit complex. My
> > sense is that having a setsockopt of TCP_RDB transparently modify the
> > nagle behavior is going to add more extra complexity and unanticipated
> > behavior than is warranted given the slight possible gain in
> > convenience to the app writer. What about a model where the
> > application user just needs to remember to call
> > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > sensible? I see your nice tests at
> >
> >   
> > https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b
> >
> > are already doing that. And my sense is that likewise most
> > well-engineered "thin stream" apps will already be using
> > setsockopt(TCP_NODELAY). Is that workable?
>
> We have been discussing this a bit back and forth. Your suggestion would be 
> the right thing to keep the nagle semantics less complex and to educate 
> developers in the intrinsics of the transport.
>
> We ended up choosing to implicitly disable nagle since it
> 1) is incompatible with the logic of RDB.
> 2) leaving it up to the developer to read the documentation and register the 
> line saying that "failing to set TCP_NODELAY will void the RDB latency gain" 
> will increase the chance of misconfigurations leading to deployment with no 
> effect.
>
> The hope was to help both the well-engineered thin-stream apps and the ones 
> deployed by developers with less detailed knowledge of the transport.
but would RDB be voided if this developer turns on RDB then turns on
Nagle later?

>
> -Andreas
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)

2015-10-23 Thread Yuchung Cheng
On Fri, Oct 23, 2015 at 1:50 PM, Bendik Rønning Opstad
 wrote:
>
> This is a request for comments.
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be made
> not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on a recently submitted patch for
> tcp_skb_cb (tcp: refactor struct tcp_skb_cb: 
> http://patchwork.ozlabs.org/patch/510674)
>
> These patches have been tested with as set of packetdrill scripts located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the 
> paper

What's the difference between RDB and TCP repacketization
(http://flylib.com/books/en/3.223.1.226/1/) ?

Reading the blog page, I am concerned the amount of
change (esp on fast path) just to bundle new writes during timeout &
retransmit, for a specific type of application? why not just send X
packets with total bytes < MSS on timeout..

> "Latency and Fairness Trade-Off for Thin Streams using Redundant Data
> Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism
>   tcp: Add Redundant Data Bundling (RDB)
>
>  Documentation/networking/ip-sysctl.txt |  23 +++
>  include/linux/skbuff.h |   1 +
>  include/linux/tcp.h|   9 +-
>  include/net/tcp.h  |  34 
>  include/uapi/linux/tcp.h   |   1 +
>  net/core/skbuff.c  |   3 +-
>  net/ipv4/Makefile  |   3 +-
>  net/ipv4/sysctl_net_ipv4.c |  35 
>  net/ipv4/tcp.c |  19 ++-
>  net/ipv4/tcp_input.c   |   3 +
>  net/ipv4/tcp_output.c  |  11 +-
>  net/ipv4/tcp_rdb.c | 281 
> +
>  12 files changed, 415 insertions(+), 8 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tcp: Fix RFC reference in comment

2015-01-13 Thread Yuchung Cheng
On Tue, Jan 13, 2015 at 1:10 PM, Debabrata Banerjee  wrote:
>
> Comment in tcp_cwnd_restart() was referencing the wrong RFC for the algorithm
> it's implementing.
>
> Signed-off-by: Debabrata Banerjee 
> ---
>  net/ipv4/tcp_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b..0c13f88 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -139,7 +139,7 @@ static __u16 tcp_advertise_mss(struct sock *sk)
> return (__u16)mss;
>  }
>
> -/* RFC2861. Reset CWND after idle period longer RTO to "restart window".
> +/* RFC2581 4.1. Reset CWND after idle period longer RTO to "restart window".
>   * This is the first part of cwnd validation mechanism. */
>  static void tcp_cwnd_restart(struct sock *sk, const struct dst_entry *dst)
>  {

RFC2861 resets the cwnd like in RFC2581, but the rest of the code
implements RFC2861. So I think the current comment is fine.

>
> --
> 2.2.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TCP connection issues against Amazon S3

2015-01-07 Thread Yuchung Cheng
On Wed, Jan 7, 2015 at 12:37 PM, Erik Grinaker  wrote:
> On 07 Jan 2015, at 15:58, Eric Dumazet  wrote:
>> On Wed, 2015-01-07 at 13:31 +, Erik Grinaker wrote:
>>> On 06 Jan 2015, at 22:00, Yuchung Cheng  wrote:
>>>> On Tue, Jan 6, 2015 at 1:04 PM, Erik Grinaker  wrote:
>>>>>
>>>>>> On 06 Jan 2015, at 20:26, Erik Grinaker  wrote:
>>>>> This still doesn’t explain why it works with older kernels, but not newer 
>>>>> ones. I’m thinking it’s
>>>> probably some minor change, which gets amplified by the lack of SACKs
>>>> on the loadbalancer. Anyway, I’ll bring it up with Amazon.
>>>> can you post traces with the older kernels?
>>>
>>> Here is a dump using 3.11.10 against a non-SACK-enabled loadbalancer:
>>>
>>> http://abstrakt.bengler.no/tcp-issues-s3-nosack-3.11.10.pcap.bz2
>>>
>>> The transfer shows lots of DUPACKs and retransmits, but this does not
>>> seem to have as bad an effect as it did with the failing transfer we
>>> saw on newer kernels:
>>>
>>> http://abstrakt.bengler.no/tcp-issues-s3-failure.pcap.bz2
>>>
>>> One big difference, which Rick touched on earlier, is that the newer
>>> kernels keep sending TCP window updates as it’s going through the
>>> retransmits. The older kernel does not do this.
>>
>> The new kernel is the receiver : It does no retransmits.
>>
>> Increasing window in ACK packets should not prevent sender into
>> retransmitting missing packets.
>>
>> Sender is not a linux host and is very buggy IMO : If receiver
>> advertises a too big window, sender decides to not retransmit in some
>> cases.
>
> I agree. I have contacted Amazon about this, but am not too hopeful for a 
> quick fix; they have been promising SACK-support on their loadbalancers since 
> 2006, for example.
>
> That said, since this change breaks a service as popular as S3, it might be 
> worth reconsidering.
With the newer kernel and bigger receive window, the sender skips (the
already slow NewReno) fast recovery and falls back to (exp backoff)
timeout recovery. Reducing rwin to accommodate the sender's bug seems
backward to me.


>
>> You can play with /proc/sys/net/ipv4/tcp_rmem and adopt very low values
>> to work around the sender bug.
>>
>> ( Or use SO_RCVBUF in receiver application)
>
> Thanks, setting SO_RCVBUF seems like a reasonable workaround.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TCP connection issues against Amazon S3

2015-01-06 Thread Yuchung Cheng
On Tue, Jan 6, 2015 at 1:04 PM, Erik Grinaker  wrote:
>
>> On 06 Jan 2015, at 20:26, Erik Grinaker  wrote:
>>
>>>
>>> On 06 Jan 2015, at 20:13, Eric Dumazet  wrote:
>>>
>>> On Tue, 2015-01-06 at 19:42 +, Erik Grinaker wrote:
>>>
 The transfer on the functioning Netherlands server does indeed use SACKs, 
 while the Norway servers do not.

 For what it’s worth, I have made stripped down pcaps for a single failing 
 transfer as well as a single functioning transfer in the Netherlands:

 http://abstrakt.bengler.no/tcp-issues-s3-failure.pcap.bz2
 http://abstrakt.bengler.no/tcp-issues-s3-success-netherlands.pcap.bz2

>>>
>>> Although sender seems to be reluctant to retransmit, this 'failure' is
>>> caused by receiver closing the connection too soon.
>>>
>>> Are you sure you do not ask curl to setup a very small completion
>>> timer ?
>>
>> For testing, I am using Curl with a 30 second timeout. This may well be a 
>> bit short, but the point is that with the older kernel I could run thousands 
>> of requests without a single failure (generally the requests would finish 
>> within seconds), while with the newer kernel about 5% of requests will time 
>> out (the rest complete within seconds).
>>
>>> 12:41:00.738336 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 767221:768681, ack 154, win 127, length 1460
>>> 12:41:00.738346 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 736561, win 1877, length 0
>>> 12:41:05.227150 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 736561:738021, ack 154, win 127, length 1460
>>> 12:41:05.227250 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 745321, win 1882, length 0
>>> 12:41:05.278287 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 768681:770141, ack 154, win 127, length 1460
>>> 12:41:05.278354 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 745321, win 1888, length 0
>>> 12:41:05.278421 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 770141:771601, ack 154, win 127, length 1460
>>> 12:41:05.278429 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 745321, win 1894, length 0
>>> 12:41:14.257102 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 745321:746781, ack 154, win 127, length 1460
>>> 12:41:14.257154 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 746781, win 1900, length 0
>>> 12:41:14.308117 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 771601:773061, ack 154, win 127, length 1460
>>> 12:41:14.308227 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 746781, win 1905, length 0
>>> 12:41:14.308387 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 773061:774521, ack 154, win 127, length 1460
>>> 12:41:14.308397 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [.], ack 
>>> 746781, win 1911, length 0
>>>
>>> -> Here receiver sends a FIN, because application closed the socket (or 
>>> died)
>>> 12:41:23.237156 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [F.], 
>>> seq 154, ack 746781, win 1911, length 0
>>> 12:41:23.289805 IP 54.231.132.98.80 > 195.159.221.106.48837: Flags [.], seq 
>>> 746781:748241, ack 155, win 127, length 1460
>>> 12:41:23.289882 IP 195.159.221.106.48837 > 54.231.132.98.80: Flags [R], seq 
>>> 505782802, win 0, length 0
>>>
>>> Anyway, getting decent speed without SACK is going to be hard.
>>
>> Yes, I am not sure why the sender (S3) disables SACK on my Norwegian servers 
>> (across ISPs), while it enables SACK on my server in the Netherlands. They 
>> run the same kernel and configuration. I will have to look into it more 
>> closely tomorrow.
>
> It turns out the Norway and Netherlands servers were resolving different 
> loadbalancers. The ones I reached in Norway did not support SACKs, while the 
> ones in the Netherlands did. Going directly to a SACK-enabled IP fixes the 
> problem.
>
> This still doesn’t explain why it works with older kernels, but not newer 
> ones. I’m thinking it’s
probably some minor change, which gets amplified by the lack of SACKs
on the loadbalancer. Anyway, I’ll bring it up with Amazon.
can you post traces with the older kernels?

>
> Many thanks for your help, everyone.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TCP connection issues against Amazon S3

2015-01-06 Thread Yuchung Cheng
On Tue, Jan 6, 2015 at 11:01 AM, Erik Grinaker  wrote:
>
>> On 06 Jan 2015, at 18:33, Yuchung Cheng  wrote:
>>
>> On Tue, Jan 6, 2015 at 10:17 AM, Erik Grinaker  wrote:
>>>
>>>> On 06 Jan 2015, at 17:20, Eric Dumazet  wrote:
>>>> On Tue, 2015-01-06 at 16:11 +, Erik Grinaker wrote:
>>>>>> On 06 Jan 2015, at 16:04, Eric Dumazet  wrote:
>>>>>> On Tue, 2015-01-06 at 15:14 +, Erik Grinaker wrote:
>>>>>>> (CCing Yuchung, as his name comes up in the relevant commits)
>>>>>>>
>>>>>>> After upgrading from Ubuntu 12.04.5 to 14.04.1 we have begun seeing
>>>>>>> intermittent TCP connection hangs for HTTP image requests against
>>>>>>> Amazon S3. 3-5% of requests will suddenly stall in the middle of the
>>>>>>> transfer before timing out. We see this problem across a range of
>>>>>>> servers, in several data centres and networks, all located in Norway.
>>>>>>>
>>>>>>> A packet dump [1] shows repeated ACK retransmits for some of the
>> TCP does not retransmit ACK ... do you mean DUPACKs sent by the receiver?
>
> Ah, sorry, they are indeed DUPACKs; I thought they were the same thing.
>
>> I am trying to understand the problem. Could you confirm that it's the
>> HTTP responses sent from Amazon S3 got stalled, or HTTP requests sent
>> from the receiver (your host)?
>
> Yes. We run HTTP GET requests against S3 for images (typically a few megs in 
> size). Once in a while, the response transfer stalls about halfway through, 
> until the client (Curl) times out. The packet dump shows loads of DUPACKs 
> early on, then TCP retransmissions until the connection is closed.

Without SACK, the sender uses NewReno fast recovery and recovers one
packet per RTT. In contrast, SACK-based fast recovery can potentially
recover all lost packets in one RTT.

I still can't explain the problem seen on newer kernel. But that got
to be some receiver related changes, not
0f7cc9a3c2bd89b15720dbf358e9b9e62af27126 b/c it's a sender side
change.

>
>> btw I suspect some middleboxes are stripping SACKOK options from your
>> SYNs (or Amazon SYN-ACKs) assuming Amazon supports SACK.
>
> That may be. I just tested this on a server in the Netherlands, and I can not 
> reproduce the problem there, while I can reproduce it from multiple locations 
> and ISPs in Norway. Would it be helpful to have a packet dump from the 
> functioning Netherlands server as well?



>
>
>>>>>>> requests. Using Ubuntu mainline kernels, we found the problem to have
>>>>>>> been introduced between 3.11.10 and 3.12.0, possibly in
>>>>>>> 0f7cc9a3c2bd89b15720dbf358e9b9e62af27126. The problem is also present
>>>>>>> in 3.18.1. Disabling tcp_window_scaling seems to solve it, but has
>>>>>>> obvious drawbacks for transfer speeds. Other sysctls do not seem to
>>>>>>> affect it.
>>>>>>>
>>>>>>> I am not sure if this is fundamentally a kernel bug or a network
>>>>>>> issue, but we did not see this problem with older kernels.
>>>>>>>
>>>>>>> [1] http://abstrakt.bengler.no/tcp-issues-s3.pcap.bz2
>>>>>>
>>>>>>
>>>>>> CC netdev
>>>>>>
>>>>>> This looks like the bug we fixed here :
>>>>>>
>>>>>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=39bb5e62867de82b269b07df900165029b928359
>>>>>
>>>>> Has that patch gone into a release? Because the problem persists with 
>>>>> 3.18.1.
>>>>
>>>> Patch is in 3.18.1 yes.
>>>>
>>>> So thats a separate issue.
>>>>
>>>> Can you confirm pcap was taken at receiver (195.159.221.106), not sender
>>>> (54.231.136.74) , and on which host is running the 'buggy kernel' ?
>>>
>>> Yes, pcap was taken on receiver (195.159.221.106).
>>>
>>>> If the sender is broken, changing the kernel on receiver wont help.
>>>>
>>>> BTW not using sack (on 54.231.132.98) is terrible for performance in
>>>> lossy environments.
>>>
>>> It may well be that the sender is broken; however, the sender is Amazon S3, 
>>> so I do not have any control over it. And in any case, the problem goes 
>>> away with 3.11.10 on receiver, but persists with 3.12.0 (or later) on 
>>> receiver, so there must be some change in 3.12.0 which has caused this to 
>>> trigger.
>>>
>>> If you are confident that the problem is with Amazon, I can get in touch 
>>> with their engineering department.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TCP connection issues against Amazon S3

2015-01-06 Thread Yuchung Cheng
On Tue, Jan 6, 2015 at 10:17 AM, Erik Grinaker  wrote:
>
>> On 06 Jan 2015, at 17:20, Eric Dumazet  wrote:
>> On Tue, 2015-01-06 at 16:11 +, Erik Grinaker wrote:
 On 06 Jan 2015, at 16:04, Eric Dumazet  wrote:
 On Tue, 2015-01-06 at 15:14 +, Erik Grinaker wrote:
> (CCing Yuchung, as his name comes up in the relevant commits)
>
> After upgrading from Ubuntu 12.04.5 to 14.04.1 we have begun seeing
> intermittent TCP connection hangs for HTTP image requests against
> Amazon S3. 3-5% of requests will suddenly stall in the middle of the
> transfer before timing out. We see this problem across a range of
> servers, in several data centres and networks, all located in Norway.
>
> A packet dump [1] shows repeated ACK retransmits for some of the
TCP does not retransmit ACK ... do you mean DUPACKs sent by the receiver?

I am trying to understand the problem. Could you confirm that it's the
HTTP responses sent from Amazon S3 got stalled, or HTTP requests sent
from the receiver (your host)?

btw I suspect some middleboxes are stripping SACKOK options from your
SYNs (or Amazon SYN-ACKs) assuming Amazon supports SACK.



> requests. Using Ubuntu mainline kernels, we found the problem to have
> been introduced between 3.11.10 and 3.12.0, possibly in
> 0f7cc9a3c2bd89b15720dbf358e9b9e62af27126. The problem is also present
> in 3.18.1. Disabling tcp_window_scaling seems to solve it, but has
> obvious drawbacks for transfer speeds. Other sysctls do not seem to
> affect it.
>
> I am not sure if this is fundamentally a kernel bug or a network
> issue, but we did not see this problem with older kernels.
>
> [1] http://abstrakt.bengler.no/tcp-issues-s3.pcap.bz2


 CC netdev

 This looks like the bug we fixed here :

 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=39bb5e62867de82b269b07df900165029b928359
>>>
>>> Has that patch gone into a release? Because the problem persists with 
>>> 3.18.1.
>>
>> Patch is in 3.18.1 yes.
>>
>> So thats a separate issue.
>>
>> Can you confirm pcap was taken at receiver (195.159.221.106), not sender
>> (54.231.136.74) , and on which host is running the 'buggy kernel' ?
>
> Yes, pcap was taken on receiver (195.159.221.106).
>
>> If the sender is broken, changing the kernel on receiver wont help.
>>
>> BTW not using sack (on 54.231.132.98) is terrible for performance in
>> lossy environments.
>
> It may well be that the sender is broken; however, the sender is Amazon S3, 
> so I do not have any control over it. And in any case, the problem goes away 
> with 3.11.10 on receiver, but persists with 3.12.0 (or later) on receiver, so 
> there must be some change in 3.12.0 which has caused this to trigger.
>
> If you are confident that the problem is with Amazon, I can get in touch with 
> their engineering department.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tcp: don't use timestamp from repaired skb-s to calculate RTT

2014-08-12 Thread Yuchung Cheng
On Tue, Aug 12, 2014 at 2:45 AM, Andrey Vagin  wrote:
> We don't know right timestamp for repaired skb-s. Wrong RTT estimations
> isn't good, because some congestion modules heavily depends on it.
>
> This patch adds the TCPCB_REPAIRED flag, which is included in
> TCPCB_RETRANS.
>
> Thanks to Eric for the advice how to fix this issue.
>
> This patch fixes the warning:
> [  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 
> tcp_ack+0x11f5/0x1380()
> [  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 
> 3.16.0-next-20140811 #1
> [  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  879.568177]   c532680c 880039643d00 
> 817aa2d2
> [  879.568776]   880039643d38 8109afbd 
> 880039d6ba80
> [  879.569386]  88003a449800 2983d6bd  
> 2983d6bc
> [  879.569982] Call Trace:
> [  879.570264]  [] dump_stack+0x4d/0x66
> [  879.570599]  [] warn_slowpath_common+0x7d/0xa0
> [  879.570935]  [] warn_slowpath_null+0x1a/0x20
> [  879.571292]  [] tcp_ack+0x11f5/0x1380
> [  879.571614]  [] tcp_rcv_established+0x1ed/0x710
> [  879.571958]  [] tcp_v4_do_rcv+0x10a/0x370
> [  879.572315]  [] release_sock+0x89/0x1d0
> [  879.572642]  [] do_tcp_setsockopt.isra.36+0x120/0x860
> [  879.573000]  [] ? rcu_read_lock_held+0x6e/0x80
> [  879.573352]  [] tcp_setsockopt+0x32/0x40
> [  879.573678]  [] sock_common_setsockopt+0x14/0x20
> [  879.574031]  [] SyS_setsockopt+0x80/0xf0
> [  879.574393]  [] system_call_fastpath+0x16/0x1b
> [  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---
>
> Cc: Eric Dumazet 
> Cc: Pavel Emelyanov 
> Cc: "David S. Miller" 
> Signed-off-by: Andrey Vagin 
> ---
>  include/net/tcp.h |  4 +++-
>  net/ipv4/tcp.c| 16 +---
>  2 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index dafa1cb..36f5525 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -705,8 +705,10 @@ struct tcp_skb_cb {
>  #define TCPCB_SACKED_RETRANS   0x02/* SKB retransmitted*/
>  #define TCPCB_LOST 0x04/* SKB is lost  */
>  #define TCPCB_TAGBITS  0x07/* All tag bits */
> +#define TCPCB_REPAIRED 0x10/* SKB repaired (no skb_mstamp) */
>  #define TCPCB_EVER_RETRANS 0x80/* Ever retransmitted frame */
> -#define TCPCB_RETRANS  (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS)
> +#define TCPCB_RETRANS  (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
> +   TCPCB_REPAIRED)
>
> __u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */
> /* 1 byte hole */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 181b70e..cb5f548 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1188,13 +1188,6 @@ new_segment:
> goto wait_for_memory;
>
> /*
> -* All packets are restored as if they have
> -* already been sent.
> -*/
> -   if (tp->repair)
> -   TCP_SKB_CB(skb)->when = 
> tcp_time_stamp;
> -
> -   /*
>  * Check whether we can use HW checksum.
>  */
> if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
> @@ -1203,6 +1196,15 @@ new_segment:
> skb_entail(sk, skb);
> copy = size_goal;
> max = size_goal;
> +
> +   /* All packets are restored as if they have
> +* already been sent. skb_mstamp isn't set to
> +* avoid wrong rtt estimation.
> +*/
> +   if (tp->repair) {
> +   TCP_SKB_CB(skb)->sacked |= 
> TCPCB_REPAIRED;
> +   TCP_SKB_CB(skb)->when = 
> tcp_time_stamp;
But this still allow RTT samples from TCP timestamp options even if
the packet is marked retransmitted/repaired in tcp_ack_update_rtt()?

> +   }
> }
>
> /* Try to append data to the end of skb. */
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/24] net, diet: Make TCP metrics optional

2014-05-05 Thread Yuchung Cheng
On Mon, May 5, 2014 at 3:25 PM, Andi Kleen  wrote:
> From: Andi Kleen 
>
> This is all the code that saves connection information
> between different sockets. Not really essential for
> small systems.
>
> Saves about 5.5k text
>
>textdata bss dec hex filename
>  492952   19571   13480  526003   806b3 net/built-in.o-with-metrics
>  487675   19275   13480  520430   7f0ee net/built-in.o-without-metrics
>
> Signed-off-by: Andi Kleen 
> ---
>  include/net/tcp.h  | 25 +
>  net/ipv4/Kconfig   |  6 ++
>  net/ipv4/Makefile  |  3 ++-
>  net/ipv4/sysctl_net_ipv4.c |  2 ++
>  4 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 87d8774..d741d2f 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -419,14 +419,29 @@ int tcp_child_process(struct sock *parent, struct sock 
> *child,
>   struct sk_buff *skb);
>  void tcp_enter_loss(struct sock *sk, int how);
>  void tcp_clear_retrans(struct tcp_sock *tp);
> +#ifdef CONFIG_TCP_METRICS
>  void tcp_update_metrics(struct sock *sk);
>  void tcp_init_metrics(struct sock *sk);
>  void tcp_metrics_init(void);
> +
>  bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
> bool paws_check);
>  bool tcp_remember_stamp(struct sock *sk);
>  bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw);
>  void tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst);
> +#else
> +static inline void tcp_update_metrics(struct sock *sk) {}
> +static inline void tcp_init_metrics(struct sock *sk) {}
> +static inline void tcp_metrics_init(void) {}
> +static inline bool tcp_peer_is_proven(struct request_sock *req,
> + struct dst_entry *dst,
> + bool paws_check) { return false; }
> +static inline bool tcp_remember_stamp(struct sock *sk) { return false; }
> +static inline bool
> +tcp_tw_remember_stamp(struct inet_timewait_sock *tw) { return false; }
> +static inline void
> +tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst) {}
> +#endif
>  void tcp_disable_fack(struct tcp_sock *tp);
>  void tcp_close(struct sock *sk, long timeout);
>  void tcp_init_sock(struct sock *sk);
> @@ -1296,11 +1311,21 @@ int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
>  const struct tcp_md5sig_key *key);
>
>  /* From tcp_fastopen.c */
> +#ifdef CONFIG_TCP_METRICS
>  void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
> struct tcp_fastopen_cookie *cookie, int *syn_loss,
> unsigned long *last_syn_loss);
>  void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
> struct tcp_fastopen_cookie *cookie, bool 
> syn_lost);
> +#else
> +static inline void
> +tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
> +  struct tcp_fastopen_cookie *cookie, int *syn_loss,
> +  unsigned long *last_syn_loss) {}
> +static inline void
> +tcp_fastopen_cache_set(struct sock *sk, u16 mss,
> +  struct tcp_fastopen_cookie *cookie, bool syn_lost) {}
> +#endif
>  struct tcp_fastopen_request {
> /* Fast Open cookie. Size 0 means a cookie request */
> struct tcp_fastopen_cookie  cookie;
> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 6146b1b..db2dada 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -264,6 +264,12 @@ config IP_PIMSM_V2
>   gated-5). This routing protocol is not used widely, so say N unless
>   you want to play with it.
>
> +config TCP_METRICS
> +   bool "Report TCP metrics over netlink"
> +   ---help---
> +   Enable support in TCP to save host information between different
> +  connections.
Please add that "Certain TCP features such as active TCP Fast Open
depends on this."

> +
>  config SYN_COOKIES
> bool "IP: TCP syncookie support"
> ---help---
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index 756855c..8b17b83 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -7,7 +7,7 @@ obj-y := route.o inetpeer.o protocol.o \
>  ip_output.o ip_sockglue.o inet_hashtables.o \
>  inet_timewait_sock.o inet_connection_sock.o \
>  tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
> -tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
> +tcp_minisocks.o tcp_cong.o tcp_fastopen.o \
>  tcp_offload.o datagram.o raw.o udp.o udplite.o \
>  udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
>  fib_frontend.o fib_semantics.o fib_trie.o \
> @@ -17,6 +17,7 @@ obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
>  obj-$(CONFIG_IP_PING) += ping.o
>  obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
>  obj-$(CONFIG_PROC_FS) += proc.o
> +obj-$(CONFIG_TCP_METRICS) += tcp_metrics.o
>  obj-$(CON

Re: Experimental Privacy Functions and TCP SYN Payloads

2014-02-12 Thread Yuchung Cheng
On Wed, Feb 12, 2014 at 3:35 AM, Daniel Borkmann  wrote:
> (please cc netdev)
>
> On 02/12/2014 11:25 AM, Quinn Wood wrote:
>>
>> If program on host A spoofs the source address of an outgoing IPv4 packet
>> then
>> places that address in the first 32 bits of a UDP payload, a program on
>> host B
>> that is aware of these behaviors can still reply to the program on host A.
>> [1]
>>
>> Continuing with this approach the program on host A could encrypt the UDP
>> pay-
>> load in a way that the program on host B can decrypt, and effectively
>> reduce
>> the ability of others in the wide network to passively determine who host
>> A is
>> sending transmissions to while simultaneously ensuring the program on host
>> B
>> can respond to the program on host A. [2]
>>
>> I'm uncertain how to proceed if I want to use TCP for stateful
>> connections.
>> The requirement of a handshake before data is handed off to the program
>> means
>> this approach won't work out of the box. I'm looking for any insight folks
>> may
>> have regarding this.
>>
>> My original approach to the handshake included setting one of the reserved
>> bits in the TCP header to indicate the first 32 bits of the payload were
>> the
>> real source address. However this would be reliant on SYN packets
>> containing
>> a payload. Does the Linux kernel allow this?
For 3.7+ you can use TCP Fast Open.

For a quick trial experiment, you can just set
sysctl net.ipv4.tcp_fastopen=0x603 on both end hosts and use
sendmsg(..., MSG_FASTOPEN) instead of connect() then send(). the
sendmsg() will behave as a combo call of connect() and send() and
return similar errno. accept() will return after data in the SYN is
received instead of after handshake is completed.

>>
>> -
>>
>> [1] Barring any non store-and-forward network behavior like dropping
>> packets
>>  with questionable source addresses. Considering recent NTP-related
>> news
>>  this seems to be a not-entirely common activity :)
>> [2] This is of course reliant on both programs knowing the proper key for
>> the
>>  other.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net] tcp: delete unused req in tcp_synack_rtt_meas()

2013-10-24 Thread Yuchung Cheng
On Thu, Oct 24, 2013 at 2:31 AM, Weiping Pan  wrote:
> The parameter req is not used since
> 375fe02c9179(tcp: consolidate SYNACK RTT sampling)
Hi Weiping:

I just sent a bug-fix "tcp: fix SYNACK RTT estimation in Fast Open"
that will also take care of this redundant parameter. Thanks.
.
>
> Signed-off-by: Weiping Pan 
> ---
>  net/ipv4/tcp_input.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a16b01b..ac8781a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2871,7 +2871,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, 
> const int flag,
>  }
>
>  /* Compute time elapsed between (last) SYNACK and the ACK completing 3WHS. */
> -static void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req)
> +static void tcp_synack_rtt_meas(struct sock *sk)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
> s32 seq_rtt = -1;
> @@ -5694,7 +5694,7 @@ int tcp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb,
> tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
> tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
> tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
> -   tcp_synack_rtt_meas(sk, req);
> +   tcp_synack_rtt_meas(sk);
>
> if (tp->rx_opt.tstamp_ok)
> tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
> --
> 1.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] tun: don't zeroize sock->file on detach

2012-08-08 Thread Yuchung Cheng
On Wed, Aug 8, 2012 at 5:53 AM, Stanislav Kinsbursky
 wrote:
> Hi, Dave.
> What about this patch?
>
>
> On Wed, Jul 11, 2012 at 03:48:20PM +0400, Stanislav Kinsbursky wrote:
>>
>> This is a fix for bug, introduced in 3.4 kernel by commit
>> 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d, which, among other things,
>> replaced
>> simple sock_put() by sk_release_kernel(). Below is sequence, which leads
>> to
>> oops for non-persistent devices:
>>
>> tun_chr_close()
>> tun_detach()<== tun->socket.file = NULL
>> tun_free_netdev()
>> sk_release_sock()
>> sock_release(sock->file == NULL)
>> iput(SOCK_INODE(sock))  <== dereference on NULL pointer
>>
>> This patch just removes zeroing of socket's file from __tun_detach().
>> sock_release() will do this.
>>
>> Signed-off-by: Stanislav Kinsbursky 
Acked-by: Yuchung Cheng 

I has tested this patch and it works (so my kernel stops crashing
using tun devices).

>> ---
>>  drivers/net/tun.c |1 -
>>  1 files changed, 0 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 987aeef..c1639f3 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -185,7 +185,6 @@ static void __tun_detach(struct tun_struct *tun)
>> netif_tx_lock_bh(tun->dev);
>> netif_carrier_off(tun->dev);
>> tun->tfile = NULL;
>> -   tun->socket.file = NULL;
>> netif_tx_unlock_bh(tun->dev);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: resurrecting tcphealth

2012-07-20 Thread Yuchung Cheng
On Mon, Jul 16, 2012 at 6:03 AM, Piotr Sawuk  wrote:
> On Mo, 16.07.2012, 13:46, Eric Dumazet wrote:
>> On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
>>> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
>>> > I am not sure if the is really necessary since the most
>>> > of the stats are available elsewhere.
>>>
>>> if by "most" you mean address and port then you're right.
>>> but even the rtt reported by "ss -i" seems to differ from tcphealth.
>>
>> Thats because tcphealth is wrong, it assumes HZ=1000 ?
>>
>> tp->srtt unit is jiffies, not ms.
>
> thanks. any conversion-functions in the kernel for that?
>>
>> tcphealth is a gross hack.
>
> what would you do if you tried making it less gross?
>
> I've not found any similar functionality, in the kernel.
> I want to know an estimate for the percentage of data lost in tcp.
> and I want to know that without actually sending much packets.
> afterall I'm on the receiving end most of the time.
> percentage of duplicate packets received is nice too.
> you have any suggestions?

counting dupack may not be as reliable as you'd like.
say the remote sends you 100 packets and only the first one is lost,
you'll see 99 dupacks. Morover any small degree reordering (<3)
will generate substantial dupacks but the network is perfectly fine
(see Craig Patridge's "reordering is not pathological" paper).
unfortunately receiver can't and does not have to distinguish loss
 or reordering. you can infer that but it should not be kernel's job.
there are public tools that inspect tcpdump traces to do that

exposing duplicate packets received can be done via getsockopt(TCP_INFO)
although I don't know what that gives you. the remote can be too
aggressive in retransmission (not just because of a bad RTO) or
the network RTT fluctuates.

I don't what if tracking last_ack_sent (the latest RCV.NXT) without
knowing the ISN is useful.

btw the term project paper cited concludes SACK is not useful is simply
wrong. This makes me suspicious about how rigorous and thoughtful of
its design.

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/