from:"Soheil Hassas Yeganeh"

Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread Soheil Hassas Yeganeh

On Wed, Dec 5, 2018 at 5:24 PM Eric Dumazet  wrote:
>
> If available rwnd is too small, tcp_tso_should_defer()
> can decide it is worth waiting before splitting a TSO packet.
>
> This really means we are rwnd limited.
>
> Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> window")
> Signed-off-by: Eric Dumazet 

Acked-by: Soheil Hassas Yeganeh 

Excellent catch! Thank you for the fix, Eric!

> ---
>  net/ipv4/tcp_output.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 68b5326f73212ffe7111dd0f91e0a1246fb0ae25..3186902347584090256467d8679320666aa0257e
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2356,8 +2356,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
> int mss_now, int nonagle,
> } else {
> if (!push_one &&
> tcp_tso_should_defer(sk, skb, _cwnd_limited,
> -max_segs))
> +max_segs)) {
> +   if (!is_cwnd_limited)
> +   is_rwnd_limited = true;
> break;
> +   }
> }
>
> limit = mss_now;
> --
> 2.20.0.rc2.403.gdbc3b29805-goog
>

Re: [PATCH net-next] tcp: remove loop to compute wscale

2018-11-29 Thread Soheil Hassas Yeganeh

On Thu, Nov 29, 2018 at 10:56 AM Eric Dumazet  wrote:
>
> We can remove the loop and conditional branches
> and compute wscale efficiently thanks to ilog2()
>
> Signed-off-by: Eric Dumazet 

Acked-by: Soheil Hassas Yeganeh 

Very nice, thank you, Eric!

> ---
>  net/ipv4/tcp_output.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 3ae4a277991a3473f61b1f73acd5ee355cf6a5df..d3b691f3a9e896ce340ae081a0d24092024361c8
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -233,16 +233,14 @@ void tcp_select_initial_window(const struct sock *sk, 
> int __space, __u32 mss,
> if (init_rcv_wnd)
> *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
>
> -   (*rcv_wscale) = 0;
> +   *rcv_wscale = 0;
> if (wscale_ok) {
> /* Set window scaling on max possible window */
> space = max_t(u32, space, 
> sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
> space = max_t(u32, space, sysctl_rmem_max);
> space = min_t(u32, space, *window_clamp);
> -   while (space > U16_MAX && (*rcv_wscale) < TCP_MAX_WSCALE) {
> -   space >>= 1;
> -   (*rcv_wscale)++;
> -   }
> +   *rcv_wscale = clamp_t(int, ilog2(space) - 15,
> + 0, TCP_MAX_WSCALE);
> }
> /* Set the clamp no higher than max representable value */
> (*window_clamp) = min_t(__u32, U16_MAX << (*rcv_wscale), 
> *window_clamp);
> --
> 2.20.0.rc0.387.gc7a69e6b6c-goog
>

Re: [PATCH net-next] net: loopback: clear skb->tstamp before netif_rx()

2018-10-19 Thread Soheil Hassas Yeganeh

On Fri, Oct 19, 2018 at 10:11 PM Eric Dumazet  wrote:
>
> At least UDP / TCP stacks can now cook skbs with a tstamp using
> MONOTONIC base (or arbitrary values with SCM_TXTIME)
>
> Since loopback driver does not call (directly or indirectly)
> skb_scrub_packet(), we need to clear skb->tstamp so that
> net_timestamp_check() can eventually resample the time,
> using ktime_get_real().
>
> Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit 
> time.")
> Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
> Signed-off-by: Eric Dumazet 
> Cc: Willem de Bruijn 
> Cc: Soheil Hassas Yeganeh 

Acked-by: Soheil Hassas Yeganeh 

Thank you, Eric!

> ---
>  drivers/net/loopback.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index 
> a7207fa7e451311aed13cdeb100e0ea7922931bf..2df7f60fe05220c19896a251b6b15239f4b95112
>  100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -69,6 +69,10 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
> int len;
>
> skb_tx_timestamp(skb);
> +
> +   /* do not fool net_timestamp_check() with various clock bases */
> +   skb->tstamp = 0;
> +
> skb_orphan(skb);
>
> /* Before queueing this packet to netif_rx(),
> --
> 2.19.1.568.g152ad8e336-goog
>

[PATCH net-next 1/2] tcp: set recv_skip_hint when tcp_inq is less than PAGE_SIZE

2018-09-26 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

When we have less than PAGE_SIZE of data on receive queue,
we set recv_skip_hint to 0. Instead, set it to the actual
number of bytes available.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 69c236943f56..3e17501fc1a1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1753,6 +1753,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
struct vm_area_struct *vma;
struct sk_buff *skb = NULL;
struct tcp_sock *tp;
+   int inq;
int ret;
 
if (address & (PAGE_SIZE - 1) || address != zc->address)
@@ -1773,12 +1774,15 @@ static int tcp_zerocopy_receive(struct sock *sk,
 
tp = tcp_sk(sk);
seq = tp->copied_seq;
-   zc->length = min_t(u32, zc->length, tcp_inq(sk));
+   inq = tcp_inq(sk);
+   zc->length = min_t(u32, zc->length, inq);
zc->length &= ~(PAGE_SIZE - 1);
-
-   zap_page_range(vma, address, zc->length);
-
-   zc->recv_skip_hint = 0;
+   if (zc->length) {
+   zap_page_range(vma, address, zc->length);
+   zc->recv_skip_hint = 0;
+   } else {
+   zc->recv_skip_hint = inq;
+   }
ret = 0;
while (length + PAGE_SIZE <= zc->length) {
if (zc->recv_skip_hint < PAGE_SIZE) {
-- 
2.19.0.605.g01d371f741-goog

[PATCH net-next 2/2] tcp: adjust rcv zerocopy hints based on frag sizes

2018-09-26 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

When SKBs are coalesced, we can have SKBs with different
frag sizes. Some with PAGE_SIZE and some not with PAGE_SIZE.
Since recv_skip_hint is always set to the full SKB size,
it can overestimate the amount that should be read using
normal read for coalesced packets.

Change the recv_skip_hint so that it only includes the first
frags that are not of PAGE_SIZE.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3e17501fc1a1..cdbd423bdeb4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1805,8 +1805,17 @@ static int tcp_zerocopy_receive(struct sock *sk,
frags++;
}
}
-   if (frags->size != PAGE_SIZE || frags->page_offset)
+   if (frags->size != PAGE_SIZE || frags->page_offset) {
+   int remaining = zc->recv_skip_hint;
+
+   while (remaining && (frags->size != PAGE_SIZE ||
+frags->page_offset)) {
+   remaining -= frags->size;
+   frags++;
+   }
+   zc->recv_skip_hint -= remaining;
break;
+   }
ret = vm_insert_page(vma, address + length,
 skb_frag_page(frags));
if (ret)
-- 
2.19.0.605.g01d371f741-goog

[PATCH net] tcp: limit sk_rcvlowat by the maximum receive buffer

2018-06-08 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

The user-provided value to setsockopt(SO_RCVLOWAT) can be
larger than the maximum possible receive buffer. Such values
mute POLLIN signals on the socket which can stall progress
on the socket.

Limit the user-provided value to half of the maximum receive
buffer, i.e., half of sk_rcvbuf when the receive buffer size
is set by the user, or otherwise half of sysctl_tcp_rmem[2].

Fixes: d1361840f8c5 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
Acked-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba2..141acd92e58ae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1694,6 +1694,13 @@ EXPORT_SYMBOL(tcp_peek_len);
 /* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
 int tcp_set_rcvlowat(struct sock *sk, int val)
 {
+   int cap;
+
+   if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+   cap = sk->sk_rcvbuf >> 1;
+   else
+   cap = sock_net(sk)->ipv4.sysctl_tcp_rmem[2] >> 1;
+   val = min(val, cap);
sk->sk_rcvlowat = val ? : 1;
 
/* Check if we need to signal EPOLLIN right now */
@@ -1702,12 +1709,7 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
return 0;
 
-   /* val comes from user space and might be close to INT_MAX */
val <<= 1;
-   if (val < 0)
-   val = INT_MAX;
-
-   val = min(val, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
if (val > sk->sk_rcvbuf) {
sk->sk_rcvbuf = val;
tcp_sk(sk)->window_clamp = tcp_win_from_space(sk, val);
-- 
2.18.0.rc1.242.g61856ae69a-goog

Re: [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events

2018-05-22 Thread Soheil Hassas Yeganeh

On Mon, May 21, 2018 at 6:08 PM, Eric Dumazet <eduma...@google.com> wrote:
> ECN signals currently forces TCP to enter quickack mode for
> up to 16 (TCP_MAX_QUICKACKS) following incoming packets.
>
> We believe this is not needed, and only sending one immediate ack
> for the current packet should be enough.
>
> This should reduce the extra load noticed in DCTCP environments,
> after congestion events.
>
> This is part 2 of our effort to reduce pure ACK packets.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thanks for the patch!

Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode

2018-05-22 Thread Soheil Hassas Yeganeh

On Mon, May 21, 2018 at 6:08 PM, Eric Dumazet <eduma...@google.com> wrote:
> We want to add finer control of the number of ACK packets sent after
> ECN events.
>
> This patch is not changing current behavior, it only enables following
> change.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_input.c | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> aebb29ab2fdf2ceaa182cd11928f145a886149ff..2e970e9f4e09d966b703af2d14d521a4328eba7e
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -203,21 +203,23 @@ static void tcp_measure_rcv_mss(struct sock *sk, const 
> struct sk_buff *skb)
> }
>  }
>
> -static void tcp_incr_quickack(struct sock *sk)
> +static void tcp_incr_quickack(struct sock *sk, unsigned int max_quickacks)
>  {
> struct inet_connection_sock *icsk = inet_csk(sk);
> unsigned int quickacks = tcp_sk(sk)->rcv_wnd / (2 * 
> icsk->icsk_ack.rcv_mss);
>
> if (quickacks == 0)
> quickacks = 2;
> +   quickacks = min(quickacks, max_quickacks);
> if (quickacks > icsk->icsk_ack.quick)
> -   icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
> +   icsk->icsk_ack.quick = quickacks;
>  }
>
> -static void tcp_enter_quickack_mode(struct sock *sk)
> +static void tcp_enter_quickack_mode(struct sock *sk, unsigned int 
> max_quickacks)
>  {
> struct inet_connection_sock *icsk = inet_csk(sk);
> -   tcp_incr_quickack(sk);
> +
> +   tcp_incr_quickack(sk, max_quickacks);
> icsk->icsk_ack.pingpong = 0;
> icsk->icsk_ack.ato = TCP_ATO_MIN;
>  }
> @@ -261,7 +263,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const 
> struct sk_buff *skb)
>  * it is probably a retransmit.
>  */
> if (tp->ecn_flags & TCP_ECN_SEEN)
> -   tcp_enter_quickack_mode((struct sock *)tp);
> +   tcp_enter_quickack_mode((struct sock *)tp, 
> TCP_MAX_QUICKACKS);
> break;
> case INET_ECN_CE:
> if (tcp_ca_needs_ecn((struct sock *)tp))
> @@ -269,7 +271,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const 
> struct sk_buff *skb)
>
> if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
> /* Better not delay acks, sender can have a very low 
> cwnd */
> -   tcp_enter_quickack_mode((struct sock *)tp);
> +   tcp_enter_quickack_mode((struct sock *)tp, 
> TCP_MAX_QUICKACKS);
> tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
> }
> tp->ecn_flags |= TCP_ECN_SEEN;
> @@ -686,7 +688,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
> sk_buff *skb)
> /* The _first_ data packet received, initialize
>  * delayed ACK engine.
>  */
> -   tcp_incr_quickack(sk);
> +   tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
> icsk->icsk_ack.ato = TCP_ATO_MIN;
> } else {
> int m = now - icsk->icsk_ack.lrcvtime;
> @@ -702,7 +704,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
> sk_buff *skb)
> /* Too long gap. Apparently sender failed to
>  * restart window, so that we send ACKs quickly.
>  */
> -   tcp_incr_quickack(sk);
> +   tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
> sk_mem_reclaim(sk);
> }
> }
> @@ -4179,7 +4181,7 @@ static void tcp_send_dupack(struct sock *sk, const 
> struct sk_buff *skb)
> if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
> before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
> NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
> -   tcp_enter_quickack_mode(sk);
> +   tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
>
> if (tcp_is_sack(tp) && sock_net(sk)->ipv4.sysctl_tcp_dsack) {
> u32 end_seq = TCP_SKB_CB(skb)->end_seq;
> @@ -4706,7 +4708,7 @@ static void tcp_data_queue(struct sock *sk, struct 
> sk_buff *skb)
> tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, 
> TCP_SKB_CB(skb)->end_seq);
>
>  out_of_window:
> -   tcp_enter_quic

Re: [PATCH net-next 2/4] tcp: do not force quickack when receiving out-of-order packets

2018-05-17 Thread Soheil Hassas Yeganeh

On Thu, May 17, 2018 at 8:12 AM, Eric Dumazet <eduma...@google.com> wrote:
> As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
> acks in slow start"), TCP stacks have to consider how many packets
> are acknowledged in one single ACK, because of GRO, but also
> because of ACK compression or losses.
>
> We plan to add SACK compression in the following patch, we
> must therefore not call tcp_enter_quickack_mode()
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you, Eric!

Re: [PATCH net] tcp: restore autocorking

2018-05-03 Thread Soheil Hassas Yeganeh

On Wed, May 2, 2018 at 11:25 PM, Eric Dumazet <eduma...@google.com> wrote:
> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: Michael Wenig <mwe...@vmware.com>
> Tested-by: Michael Wenig <mwe...@vmware.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you for catching and fixing this!

[PATCH V4 net-next 2/2] selftest: add test for TCP_INQ

2018-05-01 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c 
b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0..d044b29ddabcc
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soh...@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef TCP_INQ
+#define TCP_INQ 36
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+   struct sockaddr_in6 *addr6 = (void *) sockaddr;
+   struct sockaddr_in *addr4 = (void *) sockaddr;
+
+   switch (family) {
+   case PF_INET:
+   memset(addr4, 0, sizeof(*addr4));
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr4->sin_port = htons(port);
+   break;
+   case PF_INET6:
+   memset(addr6, 0, sizeof(*addr6));
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_loopback;
+   addr6->sin6_port = htons(port);
+   break;
+   default:
+   error(1, 0, "illegal family");
+   }
+}
+
+void *start_server(void *arg)
+{
+   int server_fd = (int)(unsigned long)arg;
+   struct sockaddr_in addr;
+   socklen_t addrlen = sizeof(addr);
+   char *buf;
+   int fd;
+   int r;
+
+   buf = malloc(BUF_SIZE);
+
+   for (;;) {
+   fd = accept(server_fd, (struct sockaddr *), );
+   if (fd == -1) {
+   perror("accept");
+   break;
+   }
+   do {
+   r = send(fd, buf, BUF_SIZE, 0);
+   } while (r < 0 && errno == EINTR);
+   if (r < 0)
+   perror("send");
+   if (r != BUF_SIZE)
+   fprintf(stderr, "can only send %d bytes\n", r);
+   /* TCP_INQ can overestimate in-queue by one byte if we send
+* the FIN packet. Sleep for 1 second, so that the client
+* likely invoked recvmsg().
+*/
+   sleep(1);
+   close(fd);
+   }
+
+   free(buf);
+   close(server_fd);
+   pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+   struct sockaddr_storage listen_addr, addr;
+   int c, one = 1, inq = -1;
+   pthread_t server_thread;
+   char cmsgbuf[CMSG_SIZE];
+   struct iovec iov[1];
+   struct cmsghdr *cm;
+   struct msghdr msg;
+   int server_fd, fd;
+   char *buf;
+
+

[PATCH V4 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-05-01 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Suggested-by: David Miller <da...@davemloft.net>
---
 include/linux/tcp.h  |  2 +-
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 43 
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index e9e8373b34b9d..29eb659aa77a1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -123,6 +123,9 @@ enum {
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
 #define TCP_ZEROCOPY_RECEIVE   35
+#define TCP_INQ36  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4028ddd14dd5a..868ed74a76a80 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1889,6 +1889,22 @@ static void tcp_recv_timestamp(struct msghdr *msg, const 
struct sock *sk,
}
 }
 
+static int tcp_inq_hint(struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+   u32 copied_seq = READ_ONCE(tp->copied_seq);
+   u32 rcv_nxt = READ_ONCE(tp->rcv_nxt);
+   int inq;
+
+   inq = rcv_nxt - copied_seq;
+   if (unlikely(inq < 0 || copied_seq != READ_ONCE(tp->copied_seq))) {
+   lock_sock(sk);
+   inq = tp->rcv_nxt - tp->copied_seq;
+   release_sock(sk);
+   }
+   return inq;
+}
+
 /*
  * This routine copies from a sock struct into the user buffer.
  *
@@ -1905,13 +1921,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1926,6 +1943,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,

Re: [PATCH V3 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-05-01 Thread Soheil Hassas Yeganeh

On Tue, May 1, 2018 at 2:34 PM, David Miller <da...@davemloft.net> wrote:
> From: Soheil Hassas Yeganeh <soheil.k...@gmail.com>
> Date: Tue,  1 May 2018 10:11:27 -0400
>
>> +static inline int tcp_inq_hint(struct sock *sk)
>
> Please do not use 'inline' in foo.c files, let the compiler decide.
>
> Otherwise looks great, thanks.

Oops, sorry about this. Will send a v4.

[PATCH V3 net-next 2/2] selftest: add test for TCP_INQ

2018-05-01 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c 
b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0..d044b29ddabcc
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soh...@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef TCP_INQ
+#define TCP_INQ 36
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+   struct sockaddr_in6 *addr6 = (void *) sockaddr;
+   struct sockaddr_in *addr4 = (void *) sockaddr;
+
+   switch (family) {
+   case PF_INET:
+   memset(addr4, 0, sizeof(*addr4));
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr4->sin_port = htons(port);
+   break;
+   case PF_INET6:
+   memset(addr6, 0, sizeof(*addr6));
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_loopback;
+   addr6->sin6_port = htons(port);
+   break;
+   default:
+   error(1, 0, "illegal family");
+   }
+}
+
+void *start_server(void *arg)
+{
+   int server_fd = (int)(unsigned long)arg;
+   struct sockaddr_in addr;
+   socklen_t addrlen = sizeof(addr);
+   char *buf;
+   int fd;
+   int r;
+
+   buf = malloc(BUF_SIZE);
+
+   for (;;) {
+   fd = accept(server_fd, (struct sockaddr *), );
+   if (fd == -1) {
+   perror("accept");
+   break;
+   }
+   do {
+   r = send(fd, buf, BUF_SIZE, 0);
+   } while (r < 0 && errno == EINTR);
+   if (r < 0)
+   perror("send");
+   if (r != BUF_SIZE)
+   fprintf(stderr, "can only send %d bytes\n", r);
+   /* TCP_INQ can overestimate in-queue by one byte if we send
+* the FIN packet. Sleep for 1 second, so that the client
+* likely invoked recvmsg().
+*/
+   sleep(1);
+   close(fd);
+   }
+
+   free(buf);
+   close(server_fd);
+   pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+   struct sockaddr_storage listen_addr, addr;
+   int c, one = 1, inq = -1;
+   pthread_t server_thread;
+   char cmsgbuf[CMSG_SIZE];
+   struct iovec iov[1];
+   struct cmsghdr *cm;
+   struct msghdr msg;
+   int server_fd, fd;
+   char *buf;
+
+

[PATCH V3 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-05-01 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling
recvmsg in parallel. When that happens we lock the
socket to calculate inq.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
Suggested-by: David Miller <da...@davemloft.net>
---
 include/linux/tcp.h  |  2 +-
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 43 
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index e9e8373b34b9d..29eb659aa77a1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -123,6 +123,9 @@ enum {
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
 #define TCP_ZEROCOPY_RECEIVE   35
+#define TCP_INQ36  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4028ddd14dd5a..ca7365db59dff 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1889,6 +1889,22 @@ static void tcp_recv_timestamp(struct msghdr *msg, const 
struct sock *sk,
}
 }
 
+static inline int tcp_inq_hint(struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+   u32 copied_seq = READ_ONCE(tp->copied_seq);
+   u32 rcv_nxt = READ_ONCE(tp->rcv_nxt);
+   int inq;
+
+   inq = rcv_nxt - copied_seq;
+   if (unlikely(inq < 0 || copied_seq != READ_ONCE(tp->copied_seq))) {
+   lock_sock(sk);
+   inq = tp->rcv_nxt - tp->copied_seq;
+   release_sock(sk);
+   }
+   return inq;
+}
+
 /*
  * This routine copies from a sock struct into the user buffer.
  *
@@ -1905,13 +1921,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1926,6 +1943,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (sk->sk_state == TCP_LISTEN)

Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Soheil Hassas Yeganeh

On Mon, Apr 30, 2018 at 12:10 PM, David Miller  wrote:
> From: Eric Dumazet 
> Date: Mon, 30 Apr 2018 09:01:47 -0700
>
>> TCP sockets are read by a single thread really (or synchronized
>> threads), or garbage is ensured, regardless of how the kernel
>> ensures locking while reporting "queue length"
>
> Whatever applications "typically do", we should never return
> garbage, and that is what this code allowing to happen.
>
> Everything else in recvmsg() operates on state under the proper socket
> lock, to ensure consistency.
>
> The only reason we are releasing the socket lock first it to make sure
> the backlog is processed and we have the most update information
> available.
>
> It seems like one is striving for correctness and better accuracy, no?
> :-)
>
> Look, this can be fixed really simply.  And if you are worried about
> unbounded loops if two apps maliciously do recvmsg() in parallel,
> then don't even loop, just fallback to full socket locking and make
> the "non-typical" application pay the price:
>
> tmp1 = A;
> tmp2 = B;
> barrier();
> tmp3 = A;
> if (unlikely(tmp1 != tmp3)) {
> lock_sock(sk);
> tmp1 = A;
> tmp2 = B;
> release_sock(sk);
> }
>
> I'm seriously not applying the patch as-is, sorry.  This issue
> must be addressed somehow.

Thank you David for the suggestion. Sure, I'll send a V3 with what you
suggested above.

Thanks,
Soheil

Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Soheil Hassas Yeganeh

On Mon, Apr 30, 2018 at 11:43 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On 04/30/2018 08:38 AM, David Miller wrote:
>> From: Soheil Hassas Yeganeh <soheil.k...@gmail.com>
>> Date: Fri, 27 Apr 2018 14:57:32 -0400
>>
>>> Since the socket lock is not held when calculating the size of
>>> receive queue, TCP_INQ is a hint.  For example, it can overestimate
>>> the queue size by one byte, if FIN is received.
>>
>> I think it is even worse than that.
>>
>> If another application comes in and does a recvmsg() in parallel with
>> these calculations, you could even report a negative value.

Thanks you David. In addition to Eric's point, for TCP specifically,
it is quite uncommon to have multiple threads calling recvmsg() for
the same socket in parallel, because the application is interested in
the streamed, in-sequence bytes. Except when the application just
wants to discard the incoming stream or has a predefined frame sizes,
this wouldn't be an issue. For such cases, the proposed INQ hint is
not going to be useful.

Could you please let me know whether you have any other example in mind?

Thanks!
Soheil

[PATCH V2 net-next 2/2] selftest: add test for TCP_INQ

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c 
b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0..3f6a27efbe5cf
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soh...@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef TCP_INQ
+#define TCP_INQ 35
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+   struct sockaddr_in6 *addr6 = (void *) sockaddr;
+   struct sockaddr_in *addr4 = (void *) sockaddr;
+
+   switch (family) {
+   case PF_INET:
+   memset(addr4, 0, sizeof(*addr4));
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr4->sin_port = htons(port);
+   break;
+   case PF_INET6:
+   memset(addr6, 0, sizeof(*addr6));
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_loopback;
+   addr6->sin6_port = htons(port);
+   break;
+   default:
+   error(1, 0, "illegal family");
+   }
+}
+
+void *start_server(void *arg)
+{
+   int server_fd = (int)(unsigned long)arg;
+   struct sockaddr_in addr;
+   socklen_t addrlen = sizeof(addr);
+   char *buf;
+   int fd;
+   int r;
+
+   buf = malloc(BUF_SIZE);
+
+   for (;;) {
+   fd = accept(server_fd, (struct sockaddr *), );
+   if (fd == -1) {
+   perror("accept");
+   break;
+   }
+   do {
+   r = send(fd, buf, BUF_SIZE, 0);
+   } while (r < 0 && errno == EINTR);
+   if (r < 0)
+   perror("send");
+   if (r != BUF_SIZE)
+   fprintf(stderr, "can only send %d bytes\n", r);
+   /* TCP_INQ can overestimate in-queue by one byte if we send
+* the FIN packet. Sleep for 1 second, so that the client
+* likely invoked recvmsg().
+*/
+   sleep(1);
+   close(fd);
+   }
+
+   free(buf);
+   close(server_fd);
+   pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+   struct sockaddr_storage listen_addr, addr;
+   int c, one = 1, inq = -1;
+   pthread_t server_thread;
+   char cmsgbuf[CMSG_SIZE];
+   struct iovec iov[1];
+   struct cmsghdr *cm;
+   struct msghdr msg;
+   int server_fd, fd;
+   char *buf;
+
+

[PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
---
 include/linux/tcp.h  |  2 +-
 include/net/tcp.h|  8 
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 27 +++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173..0986836b5df5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk)
return answ;
 }
 
+static inline int tcp_inq_hint(const struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return max_t(int, 0,
+READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq));
+}
+
 int tcp_peek_len(struct socket *sock);
 
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542..d4cdd25a7bd48 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,9 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_INQ35  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad4..5a7056980f730 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (sk->sk_state == TCP_LISTEN)
goto out;
 
+   has_cmsg = tp->recvmsg_inq;
timeo = sock_rcvtimeo(sk, nonblock);
 
/* Urgent data needs to be handled specially. */
@@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (TCP_SKB_CB(skb)->has_rxtstamp

Re: [PATCH net-next 2/2] net-backports: selftest: add test for TCP_INQ

2018-04-27 Thread Soheil Hassas Yeganeh

On Fri, Apr 27, 2018 at 2:50 PM, Soheil Hassas Yeganeh
<soheil.k...@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soh...@google.com>
>
> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
> Signed-off-by: Yuchung Cheng <ych...@google.com>
> Signed-off-by: Willem de Bruijn <will...@google.com>
> Reviewed-by: Eric Dumazet <eduma...@google.com>
> Reviewed-by: Neal Cardwell <ncardw...@google.com>

Really sorry about the wrong patch subject. I'll send a V2 with the
corrected subject momentarily.

> ---
>  tools/testing/selftests/net/Makefile  |   3 +-
>  tools/testing/selftests/net/tcp_inq.c | 189 ++
>  2 files changed, 191 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/net/tcp_inq.c
>
> diff --git a/tools/testing/selftests/net/Makefile 
> b/tools/testing/selftests/net/Makefile
> index df9102ec7b7af..0a1821f8dfb18 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
> pmtu.sh udpgso.sh
>  TEST_PROGS += udpgso_bench.sh
>  TEST_GEN_FILES =  socket
>  TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
> -TEST_GEN_FILES += tcp_mmap
> +TEST_GEN_FILES += tcp_mmap tcp_inq
>  TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
>  TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
>  TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
> @@ -18,3 +18,4 @@ include ../lib.mk
>
>  $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
>  $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
> +$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
> diff --git a/tools/testing/selftests/net/tcp_inq.c 
> b/tools/testing/selftests/net/tcp_inq.c
> new file mode 100644
> index 00000..3f6a27efbe5cf
> --- /dev/null
> +++ b/tools/testing/selftests/net/tcp_inq.c
> @@ -0,0 +1,189 @@
> +/*
> + * Copyright 2018 Google Inc.
> + * Author: Soheil Hassas Yeganeh (soh...@google.com)
> + *
> + * Simple example on how to use TCP_INQ and TCP_CM_INQ.
> + *
> + * License (GPLv2):
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
> + * more details.
> + */
> +#define _GNU_SOURCE
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#ifndef TCP_INQ
> +#define TCP_INQ 35
> +#endif
> +
> +#ifndef TCP_CM_INQ
> +#define TCP_CM_INQ TCP_INQ
> +#endif
> +
> +#define BUF_SIZE 8192
> +#define CMSG_SIZE 32
> +
> +static int family = AF_INET6;
> +static socklen_t addr_len = sizeof(struct sockaddr_in6);
> +static int port = 4974;
> +
> +static void setup_loopback_addr(int family, struct sockaddr_storage 
> *sockaddr)
> +{
> +   struct sockaddr_in6 *addr6 = (void *) sockaddr;
> +   struct sockaddr_in *addr4 = (void *) sockaddr;
> +
> +   switch (family) {
> +   case PF_INET:
> +   memset(addr4, 0, sizeof(*addr4));
> +   addr4->sin_family = AF_INET;
> +   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +   addr4->sin_port = htons(port);
> +   break;
> +   case PF_INET6:
> +   memset(addr6, 0, sizeof(*addr6));
> +   addr6->sin6_family = AF_INET6;
> +   addr6->sin6_addr = in6addr_loopback;
> +   addr6->sin6_port = htons(port);
> +   break;
> +   default:
> +   error(1, 0, "illegal family");
> +   }
> +}
> +
> +void *start_server(void *arg)
> +{
> +   int server_fd = (int)(unsigned long)arg;
> +   struct sockaddr_in addr;
> +   socklen_t addrlen = sizeof(addr);
> +   char *buf;
> +   int fd;
> +   int r;
> +
> +   buf = malloc(BUF_SIZE);
> +
> +   for (;;) {
> +   fd = accept(server_fd, (struct sockaddr *), );
> +   if (fd == -1) {
> +   perror("accept");
> +   break;
> +   }
> +   do {
> +   r = send(fd, buf, BUF_SIZE, 0);
> +   } while (r < 0 && errno == EINTR);
> +   if (r < 0)
> +

[PATCH net-next 2/2] net-backports: selftest: add test for TCP_INQ

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c 
b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0..3f6a27efbe5cf
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soh...@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef TCP_INQ
+#define TCP_INQ 35
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+   struct sockaddr_in6 *addr6 = (void *) sockaddr;
+   struct sockaddr_in *addr4 = (void *) sockaddr;
+
+   switch (family) {
+   case PF_INET:
+   memset(addr4, 0, sizeof(*addr4));
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr4->sin_port = htons(port);
+   break;
+   case PF_INET6:
+   memset(addr6, 0, sizeof(*addr6));
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_loopback;
+   addr6->sin6_port = htons(port);
+   break;
+   default:
+   error(1, 0, "illegal family");
+   }
+}
+
+void *start_server(void *arg)
+{
+   int server_fd = (int)(unsigned long)arg;
+   struct sockaddr_in addr;
+   socklen_t addrlen = sizeof(addr);
+   char *buf;
+   int fd;
+   int r;
+
+   buf = malloc(BUF_SIZE);
+
+   for (;;) {
+   fd = accept(server_fd, (struct sockaddr *), );
+   if (fd == -1) {
+   perror("accept");
+   break;
+   }
+   do {
+   r = send(fd, buf, BUF_SIZE, 0);
+   } while (r < 0 && errno == EINTR);
+   if (r < 0)
+   perror("send");
+   if (r != BUF_SIZE)
+   fprintf(stderr, "can only send %d bytes\n", r);
+   /* TCP_INQ can overestimate in-queue by one byte if we send
+* the FIN packet. Sleep for 1 second, so that the client
+* likely invoked recvmsg().
+*/
+   sleep(1);
+   close(fd);
+   }
+
+   free(buf);
+   close(server_fd);
+   pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+   struct sockaddr_storage listen_addr, addr;
+   int c, one = 1, inq = -1;
+   pthread_t server_thread;
+   char cmsgbuf[CMSG_SIZE];
+   struct iovec iov[1];
+   struct cmsghdr *cm;
+   struct msghdr msg;
+   int server_fd, fd;
+   char *buf;
+
+

[PATCH net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Reviewed-by: Neal Cardwell <ncardw...@google.com>
---
 include/linux/tcp.h  |  2 +-
 include/net/tcp.h|  8 
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 27 +++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173..0986836b5df5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk)
return answ;
 }
 
+static inline int tcp_inq_hint(const struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return max_t(int, 0,
+READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq));
+}
+
 int tcp_peek_len(struct socket *sock);
 
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542..d4cdd25a7bd48 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,9 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_INQ35  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad4..5a7056980f730 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (sk->sk_state == TCP_LISTEN)
goto out;
 
+   has_cmsg = tp->recvmsg_inq;
timeo = sock_rcvtimeo(sk, nonblock);
 
/* Urgent data needs to be handled specially. */
@@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (TCP_SKB_CB(skb)->has_rxtstamp

Re: [PATCH v3 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-26 Thread Soheil Hassas Yeganeh

On Thu, Apr 26, 2018 at 10:50 AM, Eric Dumazet <eduma...@google.com> wrote:
> When adding tcp mmap() implementation, I forgot that socket lock
> had to be taken before current->mm->mmap_sem. syzbot eventually caught
> the bug.
>
> Since we can not lock the socket in tcp mmap() handler we have to
> split the operation in two phases.
>
> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
>   This operation does not involve any TCP locking.
>
> 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
>  the transfert of pages from skbs to one VMA.
>   This operation only uses down_read(>mm->mmap_sem) after
>   holding TCP lock, thus solving the lockdep issue.
>
> This new implementation was suggested by Andy Lutomirski with great details.
>
> Benefits are :
>
> - Better scalability, in case multiple threads reuse VMAS
>(without mmap()/munmap() calls) since mmap_sem wont be write locked.
>
> - Better error recovery.
>The previous mmap() model had to provide the expected size of the
>mapping. If for some reason one part could not be mapped (partial MSS),
>the whole operation had to be aborted.
>With the tcp_zerocopy_receive struct, kernel can report how
>many bytes were successfuly mapped, and how many bytes should
>be read to skip the problematic sequence.
>
> - No more memory allocation to hold an array of page pointers.
>   16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
>
> - skbs are freed while mmap_sem has been released
>
> Following patch makes the change in tcp_mmap tool to demonstrate
> one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
>
> Note that memcg might require additional changes.
>
> Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: syzbot <syzkal...@googlegroups.com>
> Suggested-by: Andy Lutomirski <l...@kernel.org>
> Cc: linux...@kvack.org
> Cc: Soheil Hassas Yeganeh <soh...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

getsockopt() is indeed a better choice. Thanks!

> ---
>  include/uapi/linux/tcp.h |   8 ++
>  net/ipv4/tcp.c   | 192 ---
>  2 files changed, 109 insertions(+), 91 deletions(-)
>
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 
> 379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c
>  100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -122,6 +122,7 @@ enum {
>  #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions 
> */
>  #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) 
> */
>  #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
> +#define TCP_ZEROCOPY_RECEIVE   35
>
>  struct tcp_repair_opt {
> __u32   opt_code;
> @@ -276,4 +277,11 @@ struct tcp_diag_md5sig {
> __u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
>  };
>
> +/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
> +
> +struct tcp_zerocopy_receive {
> +   __u64 address;  /* in: address of mapping */
> +   __u32 length;   /* in/out: number of bytes to map/mapped */
> +   __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> +};
>  #endif /* _UAPI_LINUX_TCP_H */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> dfd090ea54ad47112fc23c61180b5bf8edd2c736..c10c4a41ad39d6f8ae472882b243c2b70c915546
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1726,118 +1726,111 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
>  }
>  EXPORT_SYMBOL(tcp_set_rcvlowat);
>
> -/* When user wants to mmap X pages, we first need to perform the mapping
> - * before freeing any skbs in receive queue, otherwise user would be unable
> - * to fallback to standard recvmsg(). This happens if some data in the
> - * requested block is not exactly fitting in a page.
> - *
> - * We only support order-0 pages for the moment.
> - * mmap() on TCP is very strict, there is no point
> - * trying to accommodate with pathological layouts.
> - */
> +static const struct vm_operations_struct tcp_vm_ops = {
> +};
> +
>  int tcp_mmap(struct file *file, struct socket *sock,
>  struct vm_area_struct *vma)
>  {
> -   unsigned long size = vma->vm_end - vma->vm_start;
> -   unsigned int nr_pages = size >> PAGE_SHIFT;
> -   struct page **pages_array = NULL;
> -   u32 seq, len, offset, nr = 0;
> -   struct sock *sk = sock->sk;
> -   const skb_frag_t *fr

Re: [PATCH v3 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

2018-04-26 Thread Soheil Hassas Yeganeh

On Thu, Apr 26, 2018 at 10:50 AM, Eric Dumazet <eduma...@google.com> wrote:
> After prior kernel change, mmap() on TCP socket only reserves VMA.
>
> We have to use getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
> to perform the transfert of pages from skbs in TCP receive queue into such 
> VMA.
>
> struct tcp_zerocopy_receive {
> __u64 address;  /* in: address of mapping */
> __u32 length;   /* in/out: number of bytes to map/mapped */
> __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> };
>
> After a successful getsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
> number of bytes that were mapped, and @recv_skip_hint contains number of bytes
> that should be read using conventional read()/recv()/recvmsg() system calls,
> to skip a sequence of bytes that can not be mapped, because not properly page
> aligned.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you, again!

> ---
>  tools/testing/selftests/net/tcp_mmap.c | 64 +++---
>  1 file changed, 37 insertions(+), 27 deletions(-)
>
> diff --git a/tools/testing/selftests/net/tcp_mmap.c 
> b/tools/testing/selftests/net/tcp_mmap.c
> index 
> dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..77f762780199ff1f69f9f6b3f18e72deddb69f5e
>  100644
> --- a/tools/testing/selftests/net/tcp_mmap.c
> +++ b/tools/testing/selftests/net/tcp_mmap.c
> @@ -76,9 +76,10 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #ifndef MSG_ZEROCOPY
>  #define MSG_ZEROCOPY0x400
> @@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length)
>  void *child_thread(void *arg)
>  {
> unsigned long total_mmap = 0, total = 0;
> +   struct tcp_zerocopy_receive zc;
> unsigned long delta_usec;
> int flags = MAP_SHARED;
> struct timeval t0, t1;
> char *buffer = NULL;
> -   void *oaddr = NULL;
> +   void *addr = NULL;
> double throughput;
> struct rusage ru;
> int lu, fd;
> @@ -153,41 +155,46 @@ void *child_thread(void *arg)
> perror("malloc");
> goto error;
> }
> +   if (zflg) {
> +   addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0);
> +   if (addr == (void *)-1)
> +   zflg = 0;
> +   }
> while (1) {
> struct pollfd pfd = { .fd = fd, .events = POLLIN, };
> int sub;
>
> poll(, 1, 1);
> if (zflg) {
> -   void *naddr;
> +   socklen_t zc_len = sizeof(zc);
> +   int res;
>
> -   naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 
> 0);
> -   if (naddr == (void *)-1) {
> -   if (errno == EAGAIN) {
> -   /* That is if SO_RCVLOWAT is buggy */
> -   usleep(1000);
> -   continue;
> -   }
> -   if (errno == EINVAL) {
> -   flags = MAP_SHARED;
> -   oaddr = NULL;
> -   goto fallback;
> -   }
> -   if (errno != EIO)
> -   perror("mmap()");
> +   zc.address = (__u64)addr;
> +   zc.length = chunk_size;
> +   zc.recv_skip_hint = 0;
> +   res = getsockopt(fd, IPPROTO_TCP, 
> TCP_ZEROCOPY_RECEIVE,
> +, _len);
> +   if (res == -1)
> break;
> +
> +   if (zc.length) {
> +   assert(zc.length <= chunk_size);
> +   total_mmap += zc.length;
> +   if (xflg)
> +   hash_zone(addr, zc.length);
> +   total += zc.length;
> }
> -   total_mmap += chunk_size;
> -   if (xflg)
> -   hash_zone(naddr, chunk_size);
> -   total += chunk_size;
>

Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive

2018-04-25 Thread Soheil Hassas Yeganeh

On Wed, Apr 25, 2018 at 5:43 PM, Eric Dumazet <eduma...@google.com> wrote:
> syzbot reported a lockdep issue caused by tcp mmap() support.
>
> I implemented Andy Lutomirski nice suggestions to resolve the
> issue and increase scalability as well.
>
> First patch is adding a new setsockopt() operation and changes mmap()
> behavior.
>
> Second patch changes tcp_mmap reference program.
>
> v2:
>  Added a missing page align of zc->length in tcp_zerocopy_receive()
>  Properly clear zc->recv_skip_hint in case user request was completed.

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you Eric for the nice redesign!

> Eric Dumazet (2):
>   tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
>   selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE
>
>  include/uapi/linux/tcp.h   |   8 ++
>  net/ipv4/tcp.c | 189 +
>  tools/testing/selftests/net/tcp_mmap.c |  63 +
>  3 files changed, 142 insertions(+), 118 deletions(-)
>
> --
> 2.17.0.441.gb46fe60e1d-goog
>

[PATCH linux-stable-4.14] tcp: clear tp->packets_out when purging write queue

2018-04-14 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Clear tp->packets_out when purging the write queue, otherwise
tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
This results in NULL pointer dereference.

Also, remove the redundant `tp->packets_out = 0` from
tcp_disconnect(), since tcp_disconnect() calls
tcp_write_queue_purge().

Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Subash Abhinov Kasiviswanathan <subas...@codeaurora.org>
Reported-by: Sami Farin <hvtaifwkbgefb...@gmail.com>
Tested-by: Sami Farin <hvtaifwkbgefb...@gmail.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
---
 include/net/tcp.h | 1 +
 net/ipv4/tcp.c| 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d323d4fa742ca..fb653736f3353 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1616,6 +1616,7 @@ static inline void tcp_write_queue_purge(struct sock *sk)
sk_mem_reclaim(sk);
tcp_clear_all_retrans_hints(tcp_sk(sk));
tcp_init_send_head(sk);
+   tcp_sk(sk)->packets_out = 0;
 }
 
 static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 38b9a6276a9de..4dda8d301802e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2354,7 +2354,6 @@ int tcp_disconnect(struct sock *sk, int flags)
icsk->icsk_backoff = 0;
tp->snd_cwnd = 2;
icsk->icsk_probes_out = 0;
-   tp->packets_out = 0;
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_cnt = 0;
tp->window_clamp = 0;
-- 
2.17.0.484.g0c8726318c-goog

[PATCH net] tcp: clear tp->packets_out when purging write queue

2018-04-14 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Clear tp->packets_out when purging the write queue, otherwise
tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
This results in NULL pointer dereference.

Also, remove the redundant `tp->packets_out = 0` from
tcp_disconnect(), since tcp_disconnect() calls
tcp_write_queue_purge().

Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Subash Abhinov Kasiviswanathan <subas...@codeaurora.org>
Reported-by: Sami Farin <hvtaifwkbgefb...@gmail.com>
Tested-by: Sami Farin <hvtaifwkbgefb...@gmail.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4fa3f812b9ff8..9ce1c726185eb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2368,6 +2368,7 @@ void tcp_write_queue_purge(struct sock *sk)
INIT_LIST_HEAD(_sk(sk)->tsorted_sent_queue);
sk_mem_reclaim(sk);
tcp_clear_all_retrans_hints(tcp_sk(sk));
+   tcp_sk(sk)->packets_out = 0;
 }
 
 int tcp_disconnect(struct sock *sk, int flags)
@@ -2417,7 +2418,6 @@ int tcp_disconnect(struct sock *sk, int flags)
icsk->icsk_backoff = 0;
tp->snd_cwnd = 2;
icsk->icsk_probes_out = 0;
-   tp->packets_out = 0;
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_cnt = 0;
tp->window_clamp = 0;
-- 
2.17.0.484.g0c8726318c-goog

Re: Socket error queue with timestamping and SOF_TIMESTAMPING_OPT_CMSG

2018-04-03 Thread Soheil Hassas Yeganeh

On Tue, Apr 3, 2018 at 11:19 AM Miroslav Lichvar  wrote:
>
> I came across an interesting issue with error messages in sockets with
> enabled timestamping using the SOF_TIMESTAMPING_OPT_CMSG option. When
> the socket is connected and there is an error (e.g. due to destination
> unreachable ICMP), select() indicates there is an exception on the
> socket, but recvmsg() reading from the error queue returns with EAGAIN
> and the application gets stuck in an infinite loop.
>
> Some observations:
> - it happens on both AF_INET and AF_INET6 SOCK_DGRAM sockets
> - enabling the IP_RECVERR option avoids getting EAGAIN
> - using recvmmsg() instead of recvmsg() avoids getting EAGAIN
>   (that is why I didn't notice it earlier)
> - disabling TX timestamping doesn't prevent the socket from having an
>   exception
> - reading from the non-error queue stops the loop
>
> Is this a bug?

POLLERR and select exceptions on an FD indicate that either (a) there
is a message on the error queue, or (b) sk_err is set. Because of (b),
it's not sufficient to only call recvmsg(MSG_ERRQUEUE). For TCP, we
must always read or write from the socket after POLLERR. For UDP,
getsockopt(SO_ERROR) would return the actual socket error and will
clear the sk_err value.

recvmmsg had a bug and was checking sk_err when we read from error
queue, before actually reading the error queue. That was really a bug
in recvmmsg which should be fixed after:
https://patchwork.ozlabs.org/patch/878861/

> It looks to me like SOF_TIMESTAMPING_OPT_CMSG implicitly, but only
> partially, enables IP_RECVERR. Are applications required to use
> IP_RECVERR in this case? My expectation was that without IP_RECVERR
> the error queue would only have messages with transmit timestamps, and
> nothing would change with reporting of real errors.

No, IP_RECVERR and SOF_TIMESTAMPING_OPT_CMSG are completely
orthogonal. When we have IP_RECVERR, the ICMP packet is simply added
to the error queue. Without IP_RECVERR, an ICMP error packet results
in setting the sk_err and the ICMP packet is discarded. This behavior is
completely unrelated to SOF_TIMESTAMPING_OPT_CMSG, and sk_err is
always set in response to ICMP packets regardless of transmit
timestamps.

Since you're only checking the error queue upon a socket
error/exception, you will need to set IP_RECVERR to make sure the ICMP
packet is kept on the error queue. You wouldn't need that if you also
check getsockopt(SO_ERROR).

> Also, from the
> documentation I had an impression that SOF_TIMESTAMPING_OPT_CMSG is a
> no-op on AF_INET6 sockets.

No, it should work for both v4 and v6.

Thanks,
Soheil

> --
> Miroslav Lichvar

Re: 4.14.2[6-7] tcp_push NULL pointer

2018-03-19 Thread Soheil Hassas Yeganeh

On Mon, Mar 19, 2018 at 10:16 AM Eric Dumazet 
wrote:



> On 03/19/2018 07:03 AM, David Miller wrote:
> > From: Eric Dumazet 
> > Date: Mon, 19 Mar 2018 05:17:37 -0700
> >
> >> We have sent a fix last week, I am not sure if David took it.
> >>
> >> https://patchwork.ozlabs.org/patch/886324/
> >
> > I thought I submitted that in my last round of -stable submissions,
> > but I have re-added this to my stable queue and will make sure it
> > really gets there if not.
> >

> Thanks David !

Thanks very much, David!

[PATCH linux-stable-4.14] tcp: reset sk_send_head in tcp_write_queue_purge

2018-03-15 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

tcp_write_queue_purge clears all the SKBs in the write queue
but does not reset the sk_send_head. As a result, we can have
a NULL pointer dereference anywhere that we use tcp_send_head
instead of the tcp_write_queue_tail.

For example, after 27fid7a8ed38 (tcp: purge write queue upon RST),
we can purge the write queue on RST. Prior to
75c119afe14f (tcp: implement rb-tree based retransmit queue),
tcp_push will only check tcp_send_head and then accesses
tcp_write_queue_tail to send the actual SKB. As a result, it will
dereference a NULL pointer.

This has been reported twice for 4.14 where we don't have
75c119afe14f:

By Timofey Titovets:

[  422.081094] BUG: unable to handle kernel NULL pointer dereference
at 0038
[  422.081254] IP: tcp_push+0x42/0x110
[  422.081314] PGD 0 P4D 0
[  422.081364] Oops: 0002 [#1] SMP PTI

By Yongjian Xu:

BUG: unable to handle kernel NULL pointer dereference at 0038
IP: tcp_push+0x48/0x120
PGD 8007ff77b067 P4D 8007ff77b067 PUD 7fd989067 PMD 0
Oops: 0002 [#18] SMP PTI
Modules linked in: tcp_diag inet_diag tcp_bbr sch_fq iTCO_wdt
iTCO_vendor_support pcspkr ixgbe mdio i2c_i801 lpc_ich joydev input_leds shpchp
e1000e igb dca ptp pps_core hwmon mei_me mei ipmi_si ipmi_msghandler sg ses
scsi_transport_sas enclosure ext4 jbd2 mbcache sd_mod ahci libahci megaraid_sas
wmi ast ttm dm_mirror dm_region_hash dm_log dm_mod dax
CPU: 6 PID: 14156 Comm: [ET_NET 6] Tainted: G D 4.14.26-1.el6.x86_64 #1
Hardware name: LENOVO ThinkServer RD440 /ThinkServer RD440, BIOS A0TS80A
09/22/2014
task: 8807d78d8140 task.stack: c9000e944000
RIP: 0010:tcp_push+0x48/0x120
RSP: 0018:c9000e947a88 EFLAGS: 00010246
RAX: 05b4 RBX: 880f7cce9c00 RCX: 
RDX:  RSI: 0040 RDI: 8807d00f5000
RBP: c9000e947aa8 R08: 1c84 R09: 
R10: 8807d00f5158 R11:  R12: 8807d00f5000
R13: 0020 R14: 000256d4 R15: 
FS: 7f5916de9700() GS:88107fd0() knlGS:
CS: 0010 DS:  ES:  CR0: 80050033
CR2: 0038 CR3: 0007f8226004 CR4: 001606e0
Call Trace:
tcp_sendmsg_locked+0x33d/0xe50
tcp_sendmsg+0x37/0x60
inet_sendmsg+0x39/0xc0
sock_sendmsg+0x49/0x60
sock_write_iter+0xb6/0x100
do_iter_readv_writev+0xec/0x130
? rw_verify_area+0x49/0xb0
do_iter_write+0x97/0xd0
vfs_writev+0x7e/0xe0
? __wake_up_common_lock+0x80/0xa0
? __fget_light+0x2c/0x70
? __do_page_fault+0x1e7/0x530
do_writev+0x60/0xf0
? inet_shutdown+0xac/0x110
SyS_writev+0x10/0x20
do_syscall_64+0x6f/0x140
? prepare_exit_to_usermode+0x8b/0xa0
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x3135ce0c57
RSP: 002b:7f5916de4b00 EFLAGS: 0293 ORIG_RAX: 0014
RAX: ffda RBX:  RCX: 003135ce0c57
RDX: 0002 RSI: 7f5916de4b90 RDI: 606f
RBP:  R08:  R09: 7f5916de8c38
R10:  R11: 0293 R12: 000464cc
R13: 7f5916de8c30 R14: 7f58d8bef080 R15: 0002
Code: 48 8b 97 60 01 00 00 4c 8d 97 58 01 00 00 41 b9 00 00 00 00 41 89 f3 4c 39
d2 49 0f 44 d1 41 81 e3 00 80 00 00 0f 85 b0 00 00 00 <80> 4a 38 08 44 8b 8f 74
06 00 00 44 89 8f 7c 06 00 00 83 e6 01
RIP: tcp_push+0x48/0x120 RSP: c9000e947a88
CR2: 0038
---[ end trace 8d545c2e93515549 ]---

Fixes: a27fid7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Timofey Titovets <nefelim...@gmail.com>
Reported-by: Yongjian Xu <yongjian...@gmail.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
---
 include/net/tcp.h | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0a13574134b8b..d323d4fa742ca 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1600,6 +1600,11 @@ enum tcp_chrono {
 void tcp_chrono_start(struct sock *sk, const enum tcp_chrono type);
 void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type);
 
+static inline void tcp_init_send_head(struct sock *sk)
+{
+   sk->sk_send_head = NULL;
+}
+
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
@@ -1610,6 +1615,7 @@ static inline void tcp_write_queue_purge(struct sock *sk)
sk_wmem_free_skb(sk, skb);
sk_mem_reclaim(sk);
tcp_clear_all_retrans_hints(tcp_sk(sk));
+   tcp_init_send_head(sk);
 }
 
 static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk)
@@ -1672,11 +1678,6 @@ static inline void tcp_check_send_head(struct sock *sk, 
struct sk_buff *skb_unli
tcp_sk(sk)->highest_sack = NULL;
 }
 
-static inline void tcp_init_send_head(struct sock *sk)
-{
-   sk->sk_send_head = NULL;
-}
-
 static inline void __tcp_add_write_que

Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors

2018-03-14 Thread Soheil Hassas Yeganeh

On Wed, Mar 14, 2018 at 12:32 PM Willem de Bruijn <
willemdebruijn.ker...@gmail.com> wrote:

> On Tue, Mar 13, 2018 at 4:35 PM, Vinicius Costa Gomes
>  wrote:
> > Hi,
> >
> > Changes from the RFC:
> >  - tweaked commit messages;
> >
> > Original cover letter:
> >
> > This is actually a "bug report"-RFC instead of the more usual "new
> > feature"-RFC.
> >
> > We are developing an application that uses TX hardware timestamping to
> > make some measurements, and during development Randy Witt initially
> > reported that the application poll() never unblocked when TX hardware
> > timestamping was enabled.
> >
> > After some investigation, it turned out the problem wasn't only
> > exclusive to hardware timestamping, and could be reproduced with
> > software timestamping.
> >
> > Applying patch (1), and running txtimestamp like this, for example:
> >
> > $ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F
> >
> > ('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
> > packets for each test, '-D' to remove the delay between packets, '-l
> > 1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
> > wait forever)
> >
> > will cause the application to become stuck in the poll() call in most
> > of the times. (Note: I couldn't reproduce the issue running against an
> > address that is routed through loopback.)
> >
> > Another interesting fact is that if the POLLIN event is added to the
> > poll() .events, poll() no longer becomes stuck,

> The process has registered interest only in POLLIN, which the call to
> sk_data_read (sock_def_readable) will trigger.

> > and more interestingly
> > the returned event in .revents is only POLLERR.

> datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.

> > After a few debugging sessions, we got to 'sock_queue_err_skb()' and
> > how it notifies applications of the error just enqueued. Changing it
> > to use 'sk->sk_error_report()', fixes the issue for hardware and
> > software timestamping. That is patch (2).
> >
> > The "solution" proposed in patch (2) looks like too big a hammer,

> It looks fine to me. POLLERR is returned regardless of the mask a
> process sets up in pollfd.events. So waking with sk_error_report
> will fix this while still waking callers waiting on POLLIN.

> Note that on sock_dequeue_err_skb, if another notification (of the
> right kind) is waiting, sk_error_report is already called instead of
> sk_data_ready.

Thank you for the fix. It looks fine to me too, because we already only arm
POLLERR for sock_dequeue_err_skb, and POLLERR is always returned when error
queue is not empty:
   if (...skb_queue_empty(>sk_error_queue))
  mask |= POLLERR ...

> This should perhaps go to net, instead of net-next (though not the test).

+1 I think the fix should go to net.

> If resending, a small nit in the test: please keep the alphabetical
> order in getopt. The filepath also looks a bit fishy, but git am applied
> the mbox from patchwork without issue.

[PATCH net] tcp: purge write queue upon aborting the connection

2018-03-06 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.

Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
---
 net/ipv4/tcp.c   | 1 +
 net/ipv4/tcp_timer.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 48636aee23c3..8b8059b7af4d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3566,6 +3566,7 @@ int tcp_abort(struct sock *sk, int err)
 
bh_unlock_sock(sk);
local_bh_enable();
+   tcp_write_queue_purge(sk);
release_sock(sk);
return 0;
 }
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 71fc60f1b326..f7d944855f8e 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -34,6 +34,7 @@ static void tcp_write_err(struct sock *sk)
sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
sk->sk_error_report(sk);
 
+   tcp_write_queue_purge(sk);
tcp_done(sk);
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONTIMEOUT);
 }
-- 
2.16.2.395.g2e18187dfd-goog

Re: [PATCH net-next 0/2] tcp_bbr: more GSO work

2018-02-28 Thread Soheil Hassas Yeganeh

On Wed, Feb 28, 2018 at 5:40 PM Eric Dumazet <eduma...@google.com> wrote:

> Playing with r8152 USB 1Gbit NIC, on both USB2 and USB3 slots,
> I found that BBR was performing poorly, because of TSO being limited to
16KB

> This patch series makes sure BBR is not under estimating number
> of packets that are needed to fill the pipe when a device has
> suboptimal TSO limits.

> Eric Dumazet (2):
>tcp_bbr: better deal with suboptimal GSO (II)
>tcp_bbr: remove bbr->tso_segs_goal

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you, Eric!

>   include/net/tcp.h |  6 ++
>   net/ipv4/tcp_bbr.c| 33 -
>   net/ipv4/tcp_output.c | 15 ---
>   3 files changed, 26 insertions(+), 28 deletions(-)

> --
> 2.16.2.395.g2e18187dfd-goog

[PATCH net] tcp: purge write queue upon RST

2018-02-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

When the connection is reset, there is no point in
keeping the packets on the write queue until the connection
is closed.

RFC 793 (page 70) and RFC 793-bis (page 64) both suggest
purging the write queue upon RST:
https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-07

Moreover, this is essential for a correct MSG_ZEROCOPY
implementation, because userspace cannot call close(fd)
before receiving zerocopy signals even when the connection
is reset.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
---
 net/ipv4/tcp_input.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 06b9c4765f42..b17fac2629c3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3998,6 +3998,7 @@ void tcp_reset(struct sock *sk)
/* This barrier is coupled with smp_rmb() in tcp_poll() */
smp_wmb();
 
+   tcp_write_queue_purge(sk);
tcp_done(sk);
 
if (!sock_flag(sk, SOCK_DEAD))
-- 
2.16.2.395.g2e18187dfd-goog

[PATCH net-next] socket: skip checking sk_err for recvmmsg(MSG_ERRQUEUE)

2018-02-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

recvmmsg does not call ___sys_recvmsg when sk_err is set.
That is fine for normal reads but, for MSG_ERRQUEUE, recvmmsg
should always call ___sys_recvmsg regardless of sk->sk_err to
be able to clear error queue. Otherwise, users are not able to
drain the error queue using recvmmsg.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Reviewed-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
---
 net/socket.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 645d32b4872c..d9a1ac233b35 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2289,10 +2289,12 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
if (!sock)
return err;
 
-   err = sock_error(sock->sk);
-   if (err) {
-   datagrams = err;
-   goto out_put;
+   if (likely(!(flags & MSG_ERRQUEUE))) {
+   err = sock_error(sock->sk);
+   if (err) {
+   datagrams = err;
+   goto out_put;
+   }
}
 
entry = mmsg;
-- 
2.16.2.395.g2e18187dfd-goog

Re: [PATCH net] tcp_bbr: better deal with suboptimal GSO

2018-02-21 Thread Soheil Hassas Yeganeh

On Wed, Feb 21, 2018 at 10:14 AM Neal Cardwell <ncardw...@google.com> wrote:

> On Wed, Feb 21, 2018 at 9:43 AM, Eric Dumazet <eric.duma...@gmail.com>
wrote:
> >
> > From: Eric Dumazet <eduma...@google.com>
> >
> > BBR uses tcp_tso_autosize() in an attempt to probe what would be the
> > burst sizes and to adjust cwnd in bbr_target_cwnd() with following
> > gold formula :
> >
> > /* Allow enough full-sized skbs in flight to utilize end systems. */
> > cwnd += 3 * bbr->tso_segs_goal;
> >
> > But GSO can be lacking or be constrained to very small
> > units (ip link set dev ... gso_max_segs 2)
> >
> > What we really want is to have enough packets in flight so that both
> > GSO and GRO are efficient.
> >
> > So in the case GSO is off or downgraded, we still want to have the same
> > number of packets in flight as if GSO/TSO was fully operational, so
> > that GRO can hopefully be working efficiently.
> >
> > To fix this issue, we make tcp_tso_autosize() unaware of
> > sk->sk_gso_max_segs
> >
> > Only tcp_tso_segs() has to enforce the gso_max_segs limit.
> >
> > Tested:
> >
> > ethtool -K eth0 tso off gso off
> > tc qd replace dev eth0 root pfifo_fast
> >
> > Before patch:
> > for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> > 691  (ss -temoi shows cwnd is stuck around 6 )
> > 667
> > 651
> > 631
> > 517
> >
> > After patch :
> > # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
> >1733 (ss -temoi shows cwnd is around 386 )
> >1778
> >1746
> >1781
> >1718
> >
> > Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
> > Signed-off-by: Eric Dumazet <eduma...@google.com>
> > Reported-by: Oleksandr Natalenko <oleksa...@natalenko.name>
> > ---
> >  net/ipv4/tcp_output.c |9 +
> >  1 file changed, 5 insertions(+), 4 deletions(-)

> Acked-by: Neal Cardwell <ncardw...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you Eric for the nice patch!

Re: [PATCH net-next 0/6] tcp: remove non GSO code

2018-02-19 Thread Soheil Hassas Yeganeh

On Mon, Feb 19, 2018 at 2:56 PM Eric Dumazet <eduma...@google.com> wrote:
> Switching TCP to GSO mode, relying on core networking layers
> to perform eventual adaptation for dumb devices was overdue.

> 1) Most TCP developments are done with TSO in mind.
> 2) Less high-resolution timers needs to be armed for TCP-pacing
> 3) GSO can benefit of xmit_more hint
> 4) Receiver GRO is more effective (as if TSO was used for real on sender)
>  -> less ACK packets and overhead.
> 5) Write queues have less overhead (one skb holds about 64KB of payload)
> 6) SACK coalescing just works. (no payload in skb->head)
> 7) rtx rb-tree contains less packets, SACK is cheaper.
> 8) Removal of legacy code. Less maintenance hassles.

> Note that I have left the sendpage/zerocopy paths, but they probably can
> benefit from the same strategy.

> Thanks to Oleksandr Natalenko for reporting a performance issue for
BBR/fq_codel,
> which was the main reason I worked on this patch series.

> Eric Dumazet (6):
> tcp: switch to GSO being always on
> tcp: remove sk_can_gso() use
> tcp: remove sk_check_csum_caps()
> tcp: tcp_sendmsg() only deals with CHECKSUM_PARTIAL
> tcp: remove dead code from tcp_set_skb_tso_segs()
> tcp: remove dead code after CHECKSUM_PARTIAL adoption

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Very nice patch-series! Thank you, Eric!

>include/net/sock.h| 10 +---
>net/core/sock.c   |  2 +-
>net/ipv4/tcp.c| 57 ---
>net/ipv4/tcp_input.c  |  3 ---
>net/ipv4/tcp_ipv4.c   | 13 +++---
>net/ipv4/tcp_output.c | 40 +-
>6 files changed, 26 insertions(+), 99 deletions(-)
> --
> 2.16.1.291.g4437f3f132-goog

[PATCH net-next 2/2] net: revert "Update RFS target at poll for tcp/udp"

2018-01-03 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

On multi-threaded processes, one common architecture is to have
one (or a small number of) threads polling sockets, and a
considerably larger pool of threads reading form and writing to the
sockets. When we set RPS core on tcp_poll() or udp_poll() we essentially
steer all packets of all the polled FDs to one (or small number of)
cores, creaing a bottleneck and/or RPS misprediction.

Another common architecture is to shard FDs among threads pinned
to cores. In such a setting, setting RPS core in tcp_poll() and
udp_poll() is redundant because the RFS core is correctly
set in recvmsg and sendmsg.

Thus, revert the following commit:
c3f1dbaf6e28 ("net: Update RFS target at poll for tcp/udp").

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
---
 net/ipv4/tcp.c | 2 --
 net/ipv4/udp.c | 2 --
 2 files changed, 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7ac583a2b9fe..f68cb33d50d1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -498,8 +498,6 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
const struct tcp_sock *tp = tcp_sk(sk);
int state;
 
-   sock_rps_record_flow(sk);
-
sock_poll_wait(file, sk_sleep(sk), wait);
 
state = inet_sk_state_load(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e9c0d1e1772e..db72619e07e4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2490,8 +2490,6 @@ unsigned int udp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
if (!skb_queue_empty(_sk(sk)->reader_queue))
mask |= POLLIN | POLLRDNORM;
 
-   sock_rps_record_flow(sk);
-
/* Check for false positives due to checksum errors */
if ((mask & POLLRDNORM) && !(file->f_flags & O_NONBLOCK) &&
!(sk->sk_shutdown & RCV_SHUTDOWN) && first_packet_length(sk) == -1)
-- 
2.16.0.rc0.223.g4a4ac83678-goog

[PATCH net-next 1/2] ip: do not set RFS core on error queue reads

2018-01-03 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

We should only record RPS on normal reads and writes.
In single threaded processes, all calls record the same state. In
multi-threaded processes where a separate thread processes
errors, the RFS table mispredicts.

Note that, when CONFIG_RPS is disabled, sock_rps_record_flow
is a noop and no branch is added as a result of this patch.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
---
 net/ipv4/af_inet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index bab98a4fedad..54cccdd8b1e3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -790,7 +790,8 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
int addr_len = 0;
int err;
 
-   sock_rps_record_flow(sk);
+   if (likely(!(flags & MSG_ERRQUEUE)))
+   sock_rps_record_flow(sk);
 
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
   flags & ~MSG_DONTWAIT, _len);
-- 
2.16.0.rc0.223.g4a4ac83678-goog

Re: [PATCH net] tcp: refresh tcp_mstamp from timers callbacks

2017-12-12 Thread Soheil Hassas Yeganeh

On Tue, Dec 12, 2017 at 9:26 PM, Neal Cardwell <ncardw...@google.com> wrote:
> On Tue, Dec 12, 2017 at 9:22 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> From: Eric Dumazet <eduma...@google.com>
>>
>> Only the retransmit timer currently refreshes tcp_mstamp
>>
>> We should do the same for delayed acks and keepalives.
>>
>> Even if RFC 7323 does not request it, this is consistent to what linux
>> did in the past, when TS values were based on jiffies.
>>
>> Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
>> Signed-off-by: Eric Dumazet <eduma...@google.com>
>> Cc: Soheil Hassas Yeganeh <soh...@google.com>
>> Cc: Mike Maloney <malo...@google.com>
>> Cc: Neal Cardwell <ncardw...@google.com>
>> ---
>
> Acked-by: Neal Cardwell <ncardw...@google.com>
>
> Thanks, Eric!
>
> neal

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

This is a very nice catch! Thank you Eric!

Re: [PATCH net] tcp: when scheduling TLP, time of RTO should account for current ACK

2017-11-17 Thread Soheil Hassas Yeganeh

On Fri, Nov 17, 2017 at 9:06 PM, Neal Cardwell <ncardw...@google.com> wrote:
>
> Fix the TLP scheduling logic so that when scheduling a TLP probe, we
> ensure that the estimated time at which an RTO would fire accounts for
> the fact that ACKs indicating forward progress should push back RTO
> times.
>
> After the following fix:
>
> df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
>
> we had an unintentional behavior change in the following kind of
> scenario: suppose the RTT variance has been very low recently. Then
> suppose we send out a flight of N packets and our RTT is 100ms:
>
> t=0: send a flight of N packets
> t=100ms: receive an ACK for N-1 packets
>
> The response before df92c8394e6e that was:
>   -> schedule a TLP for now + RTO_interval
>
> The response after df92c8394e6e is:
>   -> schedule a TLP for t=0 + RTO_interval
>
> Since RTO_interval = srtt + RTT_variance, this means that we have
> scheduled a TLP timer at a point in the future that only accounts for
> RTT_variance. If the RTT_variance term is small, this means that the
> timer fires soon.
>
> Before df92c8394e6e this would not happen, because in that code, when
> we receive an ACK for a prefix of flight, we did:
>
> 1) Near the top of tcp_ack(), switch from TLP timer to RTO
>at write_queue_head->paket_tx_time + RTO_interval:
> if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
>tcp_rearm_rto(sk);
>
> 2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
> if (flag & FLAG_ACKED) {
>tcp_rearm_rto(sk);
>
> 3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
>to TLP at now + RTO_interval:
> if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>tcp_schedule_loss_probe(sk);
>
> In df92c8394e6e we removed that 3-phase dance, and instead directly
> set the TLP timer once: we set the TLP timer in cases like this to
> write_queue_head->packet_tx_time + RTO_interval. So if the RTT
> variance is small, then this means that this is setting the TLP timer
> to fire quite soon. This means if the ACK for the tail of the flight
> takes longer than an RTT to arrive (often due to delayed ACKs), then
> the TLP timer fires too quickly.
>
> Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data 
> ACKed/SACKed")
> Signed-off-by: Neal Cardwell <ncardw...@google.com>
> Signed-off-by: Yuchung Cheng <ych...@google.com>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Nice fix. Thank you, Neal!

Re: [PATCH net-next] tcp: tcp_fragment() should not assume rtx skbs

2017-11-02 Thread Soheil Hassas Yeganeh

On Thu, Nov 2, 2017 at 9:16 PM, Neal Cardwell <ncardw...@google.com> wrote:
> On Thu, Nov 2, 2017 at 9:10 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> From: Eric Dumazet <eduma...@google.com>
>>
>> While stress testing MTU probing, we had crashes in list_del() that we 
>> root-caused
>> to the fact that tcp_fragment() is unconditionally inserting the freshly 
>> allocated
>> skb into tsorted_sent_queue list.
>>
>> But this list is supposed to contain skbs that were sent.
>> This was mostly harmless until MTU probing was enabled.
>>
>> Fortunately we can use the tcp_queue enum added later (but in same linux 
>> version)
>> for rtx-rb-tree to fix the bug.
>>
>> Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK 
>> recovery")
>> Signed-off-by: Eric Dumazet <eduma...@google.com>
>
> Acked-by: Neal Cardwell <ncardw...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Nice! Thank you, Eric!

Re: [PATCH v2 net-next] net: sk_buff rbnode reorg

2017-09-19 Thread Soheil Hassas Yeganeh

On Tue, Sep 19, 2017 at 8:14 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
>
> Current uses (TCP receive ofo queue and netem) need to save/restore
> tstamp, while skb->dev is either NULL (TCP) or a constant for a given
> queue (netem).
>
> Since we plan using an RB tree for TCP retransmit queue to speedup SACK
> processing with large BDP, this patch exchanges skb->dev and
> skb->tstamp.
>
> This saves some overhead in both TCP and netem.
>
> v2: removes the swtstamp field from struct tcp_skb_cb
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Wei Wang <wei...@google.com>
> Cc: Willem de Bruijn <will...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Very nice!

Re: [PATCH net] tcp: fix data delivery rate

2017-09-15 Thread Soheil Hassas Yeganeh

On Fri, Sep 15, 2017 at 7:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> Now skb->mstamp_skb is updated later, we also need to call
> tcp_rate_skb_sent() after the update is done.
>
> Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Nice catch!

Re: [PATCH net] tcp: update skb->skb_mstamp more carefully

2017-09-14 Thread Soheil Hassas Yeganeh

On Thu, Sep 14, 2017 at 12:32 PM, Yuchung Cheng <ych...@google.com> wrote:
> On Thu, Sep 14, 2017 at 6:57 AM, Neal Cardwell <ncardw...@google.com> wrote:
>> On Wed, Sep 13, 2017 at 11:30 PM, Eric Dumazet <eric.duma...@gmail.com> 
>> wrote:
>>>
>>> From: Eric Dumazet <eduma...@googl.com>
>>>
>>> liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
>>> in tcp_probe_timer() :
>>>   https://www.spinics.net/lists/netdev/msg454496.html
>>>
>>> After investigations, the root cause of the problem is that we update
>>> skb->skb_mstamp of skbs in write queue, even if the attempt to send a
>>> clone or copy of it failed. One reason being a routing problem.
>>>
>>> This patch prevents this, solving liujian issue.
>>>
>>> It also removes a potential RTT miscalculation, since
>>> __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
>>> TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
>>> been changed.
>>>
>>> A future ACK would then lead to a very small RTT sample and min_rtt
>>> would then be lowered to this too small value.
>> ...
>>>
>>> Signed-off-by: Eric Dumazet <eduma...@googl.com>
>>> Reported-by: liujian <liujia...@huawei.com>
>>> ---
>>>  net/ipv4/tcp_output.c |   19 ---
>>>  1 file changed, 12 insertions(+), 7 deletions(-)
>>
>> Acked-by: Neal Cardwell <ncardw...@google.com>
> Acked-by: Yuchung Cheng <ych...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
Very nice! Thank you, Eric!

Re: [PATCH iproute2] tc: fq: support low_rate_threshold attribute

2017-09-08 Thread Soheil Hassas Yeganeh

On Fri, Sep 8, 2017 at 5:12 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> TCA_FQ_LOW_RATE_THRESHOLD sch_fq attribute was added in linux-4.9
>
> Tested:
>
> lpaa5:/tmp# tc -qd add dev eth1 root fq
> lpaa5:/tmp# tc -s qd sh dev eth1
> qdisc fq 8003: root refcnt 5 limit 1p flow_limit 1000p buckets 4096 \
>  orphan_mask 4095 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 quantum 
> 3648 \
>  initial_quantum 18240 low_rate_threshold 550Kbit refill_delay 40.0ms
>  Sent 62139 bytes 395 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>   116 flows (114 inactive, 0 throttled)
>   1 gc, 0 highprio, 0 throttled
>
> lpaa5:/tmp# ./netperf -H lpaa6 -t TCP_RR -l10 -- -q 50 -r 300,300 -o 
> P99_LATENCY
> 99th Percentile Latency Microseconds
> 7081
>
> lpaa5:/tmp# tc qd replace dev eth1 root fq low_rate_threshold 10Mbit
> lpaa5:/tmp# ./netperf -H lpaa6 -t TCP_RR -l10 -- -q 50 -r 300,300 -o 
> P99_LATENCY
> 99th Percentile Latency Microseconds
> 858
>
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you, Eric!

Re: [PATCH net-next] skbuff: only inherit relevant tx_flags

2017-06-08 Thread Soheil Hassas Yeganeh

On Thu, Jun 8, 2017 at 11:35 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
>
> From: Willem de Bruijn <will...@google.com>
>
> When inheriting tx_flags from one skbuff to another, always apply a
> mask to avoid overwriting unrelated other bits in the field.
>
> The two SKBTX_SHARED_FRAG cases clears all other bits. In practice,
> tx_flags are zero at this point now. But this is fragile. Timestamp
> flags are set, for instance, if in tcp_gso_segment, after this clear
> in skb_segment.
>
> The SKBTX_ANY_TSTAMP mask in __skb_tstamp_tx ensures that new
> skbs do not accidentally inherit flags such as SKBTX_SHARED_FRAG.
>
> Signed-off-by: Willem de Bruijn <will...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

[PATCH net] sock: reset sk_err when the error queue is empty

2017-06-02 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Prior to f5f99309fa74 (sock: do not set sk_err in
sock_dequeue_err_skb), sk_err was reset to the error of
the skb on the head of the error queue.

Applications, most notably ping, are relying on this
behavior to reset sk_err for ICMP packets.

Set sk_err to the ICMP error when there is an ICMP packet
at the head of the error queue.

Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
Reported-by: Cyril Hrubis <chru...@suse.cz>
Test-by: Cyril Hrubis <chru...@suse.cz>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
---
 net/core/skbuff.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 346d3e85dfbc..b1be7c01efe2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3754,8 +3754,11 @@ struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
 
spin_lock_irqsave(>lock, flags);
skb = __skb_dequeue(q);
-   if (skb && (skb_next = skb_peek(q)))
+   if (skb && (skb_next = skb_peek(q))) {
icmp_next = is_icmp_err_skb(skb_next);
+   if (icmp_next)
+   sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;
+   }
spin_unlock_irqrestore(>lock, flags);
 
if (is_icmp_err_skb(skb) && !icmp_next)
-- 
2.13.0.506.g27d5fe0cd-goog

Re: commit f5f99309 (sock: do not set sk_err in sock_dequeue_err_skb) has broken ping

2017-06-01 Thread Soheil Hassas Yeganeh

On Thu, Jun 1, 2017 at 11:36 AM, Cyril Hrubis  wrote:
> It seems to repeatedly produce (until I plug the cable back):
>
> ee_errno = 113 ee_origin = 2 ee_type = 3 ee_code = 1 ee_info = 0 ee_data = 0
>
> So we get EHOSTUNREACH on SO_EE_ORIGIN_ICMP.

Thank you very much! I have a wild guess that, when we
have a train of skbs on the error queue starting from a local error,
we will see this issue.

Ping (without my patch) considers EAGAIN on a normal read as an
indication that there is nothing on the error queue, but that's a
flawed assumption.

Would you mind trying another shot in the darkness please? Thanks!

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5a726161f4e4..097152a03c74 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3742,7 +3742,8 @@ EXPORT_SYMBOL(sock_queue_err_skb);
 static bool is_icmp_err_skb(const struct sk_buff *skb)
 {
return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
-  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6);
+  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 ||
+  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL);
 }

 struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
@@ -3751,14 +3752,19 @@ struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
struct sk_buff *skb, *skb_next = NULL;
bool icmp_next = false;
unsigned long flags;
+   int err = 0;

spin_lock_irqsave(>lock, flags);
skb = __skb_dequeue(q);
-   if (skb && (skb_next = skb_peek(q)))
+   if (skb && (skb_next = skb_peek(q))) {
icmp_next = is_icmp_err_skb(skb_next);
+   err = SKB_EXT_ERR(skb_next)->ee.ee_origin;
+   }
spin_unlock_irqrestore(>lock, flags);

-   if (is_icmp_err_skb(skb) && !icmp_next)
+   if (icmp_next)
+   sk->sk_err = err;
+   else if (is_icmp_err_skb(skb) && !icmp_next)
sk->sk_err = 0;

if (skb_next)

Re: commit f5f99309 (sock: do not set sk_err in sock_dequeue_err_skb) has broken ping

2017-06-01 Thread Soheil Hassas Yeganeh

On Thu, Jun 1, 2017 at 11:10 AM, Cyril Hrubis  wrote:
>> Thank you for the confirmation. Could you please try the following
>> patch to see if it fixes your issue?
>
> Does not seem to help, I still got the same bussy loop.

Thank you for trying the patch. Unfortunately, I can't reproduce on my
machines here.Would you humor me with another one? Thank you!

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5a726161f4e4..49207298fcea 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3742,7 +3742,8 @@ EXPORT_SYMBOL(sock_queue_err_skb);
 static bool is_icmp_err_skb(const struct sk_buff *skb)
 {
return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
-  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6);
+  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 ||
+  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL);
 }

 struct sk_buff *sock_dequeue_err_skb(struct sock *sk)

Re: commit f5f99309 (sock: do not set sk_err in sock_dequeue_err_skb) has broken ping

2017-06-01 Thread Soheil Hassas Yeganeh

On Thu, Jun 1, 2017 at 10:31 AM, Cyril Hrubis <chru...@suse.cz> wrote:
> I've started bisecting on v4.11 and see the problem on v4.10 on another
> machine, the patch should be there in both cases and the bug is easily
> reproducible.

Thank you for the confirmation. Could you please try the following
patch to see if it fixes your issue?

>From 3ec438460425d127741b20f03f78644c9e441e8c Mon Sep 17 00:00:00 2001
From: Soheil Hassas Yeganeh <soh...@google.com>
Date: Thu, 1 Jun 2017 10:34:09 -0400
Subject: [PATCH net] sock: reset sk_err when the error queue is empty

Before f5f99309fa74 (sock: do not set sk_err in
sock_dequeue_err_skb), sk_err was reset to 0 upon reading
from the error queue when the error queue was empty.

Applications, most notably ping, are relying on this
behavior to reset sk_err.

Reset sk_err when there is no packet left on the
error queue.

Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
Reported-by: Cyril Hrubis <chru...@suse.cz>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 346d3e85dfbc..5a726161f4e4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3758,7 +3758,7 @@ struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
  icmp_next = is_icmp_err_skb(skb_next);
  spin_unlock_irqrestore(>lock, flags);

- if (is_icmp_err_skb(skb) && !icmp_next)
+ if ((is_icmp_err_skb(skb) && !icmp_next) || !skb_next)
  sk->sk_err = 0;

  if (skb_next)
-- 
2.13.0.219.gdb65acc882-goog


>> 2. I've also have sent a fix to iputils on
>> https://github.com/iputils/iputils/pull/75. Would you be kind to try
>> that pull request as well?
>
> That fixed the problem, you can add:
>
> Tested-by: Cyril Hrubis <chru...@suse.cz>

Thank you for testing! Will do.

Re: commit f5f99309 (sock: do not set sk_err in sock_dequeue_err_skb) has broken ping

2017-06-01 Thread Soheil Hassas Yeganeh

On Thu, Jun 1, 2017 at 10:00 AM, Cyril Hrubis <chru...@suse.cz> wrote:
> I've bisected the problem to this commit:
>
> commit f5f99309fa7481f59a500f0d08f3379cd6424c1f (HEAD, refs/bisect/bad)
> Author: Soheil Hassas Yeganeh <soh...@google.com>
> Date:   Thu Nov 3 18:24:27 2016 -0400
>
> sock: do not set sk_err in sock_dequeue_err_skb

Hi Cyril,

I'm sorry for the problem, and thank you for the report.

Two questions:
1. Could you double check whether you have the following commit in your tree?

commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d
Author: Soheil Hassas Yeganeh <soh...@google.com>
Date:   Wed Nov 30 14:01:08 2016 -0500
sock: reset sk_err for ICMP packets read from error queue

2. I've also have sent a fix to iputils on
https://github.com/iputils/iputils/pull/75. Would you be kind to try
that pull request as well?

Thanks,
Soheil

Re: [PATCH net-next] tcp: fix TCP_SYNCNT flakes

2017-05-23 Thread Soheil Hassas Yeganeh

On Tue, May 23, 2017 at 3:38 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> After the mentioned commit, some of our packetdrill tests became flaky.
>
> TCP_SYNCNT socket option can limit the number of SYN retransmits.
>
> retransmits_timed_out() has to compare times computations based on
> local_clock() while timers are based on jiffies. With NTP adjustments
> and roundings we can observe 999 ms delay for 1000 ms timers.
> We end up sending one extra SYN packet.
>
> Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate
> TCP's connection close threshold as a time value") makes no
> real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at
> all.
>
> Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set
> parameter from retransmits_timed_out()
>
> Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Signed-off-by: Yuchung Cheng <ych...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Nice!

Re: [PATCH net-next] tcp: fix tcp_probe_timer() for TCP_USER_TIMEOUT

2017-05-21 Thread Soheil Hassas Yeganeh

On Sun, May 21, 2017 at 1:39 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> TCP_USER_TIMEOUT is still converted to jiffies value in
> icsk_user_timeout
>
> So we need to make a conversion for the cases HZ != 1000
>
> Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you for the fix, Eric!

Re: [PATCH net-next] tcp: fix tcp_rearm_rto()

2017-05-18 Thread Soheil Hassas Yeganeh

On Thu, May 18, 2017 at 12:15 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> skbs in (re)transmit queue no longer have a copy of jiffies
> at the time of the transmit : skb->skb_mstamp is now in usec unit,
> with no correlation to tcp_jiffies32.
>
> We have to convert rto from jiffies to usec, compute a time difference
> in usec, then convert the delta to HZ units.
>
> Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

Thank you for the quick fix, Eric!

Re: [PATCH v3 net-next 4/7] net: add new control message for incoming HW-timestamped packets

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 8:44 AM, Miroslav Lichvar  wrote:
> Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message
> for incoming packets with hardware timestamps. It contains the index of
> the real interface which received the packet and the length of the
> packet at layer 2.
>
> The index is useful with bonding, bridges and other interfaces, where
> IP_PKTINFO doesn't allow applications to determine which PHC made the
> timestamp. With the L2 length (and link speed) it is possible to
> transpose preamble timestamps to trailer timestamps, which are used in
> the NTP protocol.
>
> While this information could be provided by two new socket options
> independently from timestamping, it doesn't look like they would be very
> useful. With this option any performance impact is limited to hardware
> timestamping.
>
> Use dev_get_by_napi_id() to get the device and its index. On kernels
> with disabled CONFIG_NET_RX_BUSY_POLL or drivers not using NAPI, a zero
> index will be returned in the control message.
>
> CC: Richard Cochran 
> CC: Willem de Bruijn 
> Signed-off-by: Miroslav Lichvar 
> ---
>  Documentation/networking/timestamping.txt |  9 +
>  include/uapi/asm-generic/socket.h |  2 ++
>  include/uapi/linux/net_tstamp.h   | 10 +-
>  net/socket.c  | 27 ++-
>  4 files changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/networking/timestamping.txt 
> b/Documentation/networking/timestamping.txt
> index 96f5069..600c6bf 100644
> --- a/Documentation/networking/timestamping.txt
> +++ b/Documentation/networking/timestamping.txt
> @@ -193,6 +193,15 @@ SOF_TIMESTAMPING_OPT_STATS:
>the transmit timestamps, such as how long a certain block of
>data was limited by peer's receiver window.
>
> +SOF_TIMESTAMPING_OPT_PKTINFO:
> +
> +  Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
> +  packets with hardware timestamps. The message contains struct
> +  scm_ts_pktinfo, which supplies the index of the real interface which
> +  received the packet and its length at layer 2. A valid (non-zero)
> +  interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is
> +  enabled and the driver is using NAPI.
> +
>  New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
>  disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
>  regardless of the setting of sysctl net.core.tstamp_allow_data.
> diff --git a/include/uapi/asm-generic/socket.h 
> b/include/uapi/asm-generic/socket.h
> index 2b48856..a5f6e81 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -100,4 +100,6 @@
>
>  #define SO_COOKIE  57
>
> +#define SCM_TIMESTAMPING_PKTINFO   58
> +
>  #endif /* __ASM_GENERIC_SOCKET_H */
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index 0749fb1..f2fb455 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -9,6 +9,7 @@
>  #ifndef _NET_TIMESTAMPING_H
>  #define _NET_TIMESTAMPING_H
>
> +#include 
>  #include/* for SO_TIMESTAMPING */
>
>  /* SO_TIMESTAMPING gets an integer bit field comprised of these values */
> @@ -26,8 +27,9 @@ enum {
> SOF_TIMESTAMPING_OPT_CMSG = (1<<10),
> SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
> SOF_TIMESTAMPING_OPT_STATS = (1<<12),
> +   SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
>
> -   SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_STATS,
> +   SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_PKTINFO,
> SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
>  SOF_TIMESTAMPING_LAST
>  };
> @@ -130,4 +132,10 @@ enum hwtstamp_rx_filters {
> HWTSTAMP_FILTER_NTP_ALL,
>  };
>
> +/* SCM_TIMESTAMPING_PKTINFO control message */
> +struct scm_ts_pktinfo {
> +   __u32 if_index;
> +   __u32 pkt_length;
> +};
> +

Given that this structure can change in the future, it might be worth
considering using TLVs (e.g., netlink attributes), similar to what
tcp_get_timestamping_opt_stats() does. This would simplify
adding/removing fields to/from the control message.

Thanks,
Soheil


>  #endif /* _NET_TIMESTAMPING_H */
> diff --git a/net/socket.c b/net/socket.c
> index c2564eb..ee1f4ec 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -662,6 +662,27 @@ static bool skb_is_err_queue(const struct sk_buff *skb)
> return skb->pkt_type == PACKET_OUTGOING;
>  }
>
> +static void put_ts_pktinfo(struct msghdr *msg, struct sk_buff *skb)
> +{
> +   struct scm_ts_pktinfo ts_pktinfo;
> +   struct net_device *orig_dev;
> +   int ifindex = 0;
> +
> +   if (!skb_mac_header_was_set(skb))
> +   return;
> +
> +   rcu_read_lock();
> +   orig_dev = dev_get_by_napi_id(skb_napi_id(skb));
> +   if (orig_dev)
> +   ifindex

Re: [PATCH net-next 15/15] tcp: switch TCP TS option (RFC 7323) to 1ms clock

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> TCP Timestamps option is defined in RFC 7323
>
> Traditionally on linux, it has been tied to the internal
> 'jiffies' variable, because it had been a cheap and good enough
> generator.
>
> For TCP flows on the Internet, 1 ms resolution would be much better
> than 4ms or 10ms (HZ=250 or HZ=100 respectively)
>
> For TCP flows in the DC, Google has used usec resolution for more
> than two years with great success [1]
>
> Receive size autotuning (DRS) is indeed more precise and converges
> faster to optimal window size.
>
> This patch converts tp->tcp_mstamp to a plain u64 value storing
> a 1 usec TCP clock.
>
> This choice will allow us to upstream the 1 usec TS option as
> discussed in IETF 97.
>
> [1] 
> https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  include/linux/skbuff.h   | 62 +-
>  include/linux/tcp.h  | 22 -
>  include/net/tcp.h| 59 
>  net/ipv4/syncookies.c|  8 ++--
>  net/ipv4/tcp.c   |  4 +-
>  net/ipv4/tcp_bbr.c   | 22 -
>  net/ipv4/tcp_input.c | 96 
> 
>  net/ipv4/tcp_ipv4.c  | 17 +++
>  net/ipv4/tcp_lp.c| 12 ++---
>  net/ipv4/tcp_minisocks.c |  4 +-
>  net/ipv4/tcp_output.c| 16 +++
>  net/ipv4/tcp_rate.c  | 16 +++
>  net/ipv4/tcp_recovery.c  | 23 +-
>  net/ipv4/tcp_timer.c |  8 ++--
>  net/ipv6/syncookies.c|  2 +-
>  net/ipv6/tcp_ipv6.c  |  4 +-
>  net/netfilter/nf_synproxy_core.c |  2 +-
>  17 files changed, 178 insertions(+), 199 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 
> bfc7892f6c33c9fdfb7c0d8110f80cfb12d1ae61..7c0cb2ce8b01a9be366d8cdb7e3661f65ebff3c9
>  100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -506,66 +506,6 @@ typedef unsigned int sk_buff_data_t;
>  typedef unsigned char *sk_buff_data_t;
>  #endif
>
> -/**
> - * struct skb_mstamp - multi resolution time stamps
> - * @stamp_us: timestamp in us resolution
> - * @stamp_jiffies: timestamp in jiffies
> - */
> -struct skb_mstamp {
> -   union {
> -   u64 v64;
> -   struct {
> -   u32 stamp_us;
> -   u32 stamp_jiffies;
> -   };
> -   };
> -};
> -
> -/**
> - * skb_mstamp_get - get current timestamp
> - * @cl: place to store timestamps
> - */
> -static inline void skb_mstamp_get(struct skb_mstamp *cl)
> -{
> -   u64 val = local_clock();
> -
> -   do_div(val, NSEC_PER_USEC);
> -   cl->stamp_us = (u32)val;
> -   cl->stamp_jiffies = (u32)jiffies;
> -}
> -
> -/**
> - * skb_mstamp_delta - compute the difference in usec between two skb_mstamp
> - * @t1: pointer to newest sample
> - * @t0: pointer to oldest sample
> - */
> -static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
> - const struct skb_mstamp *t0)
> -{
> -   s32 delta_us = t1->stamp_us - t0->stamp_us;
> -   u32 delta_jiffies = t1->stamp_jiffies - t0->stamp_jiffies;
> -
> -   /* If delta_us is negative, this might be because interval is too big,
> -* or local_clock() drift is too big : fallback using jiffies.
> -*/
> -   if (delta_us <= 0 ||
> -   delta_jiffies >= (INT_MAX / (USEC_PER_SEC / HZ)))
> -
> -   delta_us = jiffies_to_usecs(delta_jiffies);
> -
> -   return delta_us;
> -}
> -
> -static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
> -   const struct skb_mstamp *t0)
> -{
> -   s32 diff = t1->stamp_jiffies - t0->stamp_jiffies;
> -
> -   if (!diff)
> -   diff = t1->stamp_us - t0->stamp_us;
> -   return diff > 0;
> -}
> -
>  /**
>   * struct sk_buff - socket buffer
>   * @next: Next buffer in list
> @@ -646,7 +586,7 @@ struct sk_buff {
>
> union {
> ktime_t tstamp;
> -   struct skb_mstamp skb_mstamp;
> +   u64 skb_mstamp;
> };
> };
> struct rb_no

Re: [PATCH net-next 12/15] tcp_westwood: use tcp_jiffies32 instead of tcp_time_stamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> This CC does not need 1 ms tcp_time_stamp and can use
> the jiffy based 'timestamp'.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_westwood.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
> index 
> 9775453b8d174c848dc09df83d1fa185422cd8cc..bec9cafbe3f92938e5d79d743d629b2f33464418
>  100644
> --- a/net/ipv4/tcp_westwood.c
> +++ b/net/ipv4/tcp_westwood.c
> @@ -68,7 +68,7 @@ static void tcp_westwood_init(struct sock *sk)
> w->cumul_ack = 0;
> w->reset_rtt_min = 1;
> w->rtt_min = w->rtt = TCP_WESTWOOD_INIT_RTT;
> -   w->rtt_win_sx = tcp_time_stamp;
> +   w->rtt_win_sx = tcp_jiffies32;
> w->snd_una = tcp_sk(sk)->snd_una;
> w->first_ack = 1;
>  }
> @@ -116,7 +116,7 @@ static void tcp_westwood_pkts_acked(struct sock *sk,
>  static void westwood_update_window(struct sock *sk)
>  {
> struct westwood *w = inet_csk_ca(sk);
> -   s32 delta = tcp_time_stamp - w->rtt_win_sx;
> +   s32 delta = tcp_jiffies32 - w->rtt_win_sx;
>
> /* Initialize w->snd_una with the first acked sequence number in order
>  * to fix mismatch between tp->snd_una and w->snd_una for the first
> @@ -140,7 +140,7 @@ static void westwood_update_window(struct sock *sk)
> westwood_filter(w, delta);
>
> w->bk = 0;
> -   w->rtt_win_sx = tcp_time_stamp;
> +   w->rtt_win_sx = tcp_jiffies32;
> }
>  }
>
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 11/15] tcp: use tcp_jiffies32 in __tcp_oow_rate_limited()

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> This place wants to use tcp_jiffies32, this is good enough.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> eeb4967df25a8dc35128d0a0848b5ae7ee6d63e3..85575888365a10643e096f9e019adaa3eda87d40
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3390,7 +3390,7 @@ static bool __tcp_oow_rate_limited(struct net *net, int 
> mib_idx,
>u32 *last_oow_ack_time)
>  {
> if (*last_oow_ack_time) {
> -   s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time);
> +   s32 elapsed = (s32)(tcp_jiffies32 - *last_oow_ack_time);
>
> if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) {
> NET_INC_STATS(net, mib_idx);
> @@ -3398,7 +3398,7 @@ static bool __tcp_oow_rate_limited(struct net *net, int 
> mib_idx,
> }
> }
>
> -   *last_oow_ack_time = tcp_time_stamp;
> +   *last_oow_ack_time = tcp_jiffies32;
>
> return false;   /* not rate-limited: go ahead, send dupack now! */
>  }
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 14/15] tcp: replace misc tcp_time_stamp to tcp_jiffies32

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> After this patch, all uses of tcp_time_stamp will require
> a change when we introduce 1 ms and/or 1 us TCP TS option.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp.c   | 2 +-
>  net/ipv4/tcp_htcp.c  | 2 +-
>  net/ipv4/tcp_input.c | 2 +-
>  net/ipv4/tcp_minisocks.c | 2 +-
>  net/ipv4/tcp_output.c| 4 ++--
>  5 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> b85bfe7cb11dca68952cc4be19b169d893963fef..85005480052626c5769ef100a868c88fad803f75
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -386,7 +386,7 @@ void tcp_init_sock(struct sock *sk)
>
> icsk->icsk_rto = TCP_TIMEOUT_INIT;
> tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> -   minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
> +   minmax_reset(>rtt_min, tcp_jiffies32, ~0U);
>
> /* So many TCP implementations out there (incorrectly) count the
>  * initial SYN frame in their delayed-ACK and congestion control
> diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c
> index 
> 4a4d8e76738fa2831dcc3ecec5924dd3dfb7bf58..3eb78cde6ff0a22b7b411f0ae4258b6ef74ffe73
>  100644
> --- a/net/ipv4/tcp_htcp.c
> +++ b/net/ipv4/tcp_htcp.c
> @@ -104,7 +104,7 @@ static void measure_achieved_throughput(struct sock *sk,
> const struct inet_connection_sock *icsk = inet_csk(sk);
> const struct tcp_sock *tp = tcp_sk(sk);
> struct htcp *ca = inet_csk_ca(sk);
> -   u32 now = tcp_time_stamp;
> +   u32 now = tcp_jiffies32;
>
> if (icsk->icsk_ca_state == TCP_CA_Open)
> ca->pkts_acked = sample->pkts_acked;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 85575888365a10643e096f9e019adaa3eda87d40..10e6775464f647a65ea0d19c10b421f9cd38923d
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2911,7 +2911,7 @@ static void tcp_update_rtt_min(struct sock *sk, u32 
> rtt_us)
> struct tcp_sock *tp = tcp_sk(sk);
> u32 wlen = sysctl_tcp_min_rtt_wlen * HZ;
>
> -   minmax_running_min(>rtt_min, wlen, tcp_time_stamp,
> +   minmax_running_min(>rtt_min, wlen, tcp_jiffies32,
>rtt_us ? : jiffies_to_usecs(1));
>  }
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 
> 59c32e0086c0e46d7955dffe211ec03bb18dcb12..6504f1082bdfda77bfc1b53d0d85928e5083a24e
>  100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -445,7 +445,7 @@ struct sock *tcp_create_openreq_child(const struct sock 
> *sk,
>
> newtp->srtt_us = 0;
> newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> -   minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
> +   minmax_reset(>rtt_min, tcp_jiffies32, ~0U);
> newicsk->icsk_rto = TCP_TIMEOUT_INIT;
> newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 1011ea40c2ba4c12cce21149cab176e1fa4db583..65472e931a0b79f7078a4da7db802dfcc32c7621
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2418,10 +2418,10 @@ bool tcp_schedule_loss_probe(struct sock *sk)
> timeout = max_t(u32, timeout, msecs_to_jiffies(10));
>
> /* If RTO is shorter, just schedule TLP in its place. */
> -   tlp_time_stamp = tcp_time_stamp + timeout;
> +   tlp_time_stamp = tcp_jiffies32 + timeout;
> rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
> if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
> -   s32 delta = rto_time_stamp - tcp_time_stamp;
> +   s32 delta = rto_time_stamp - tcp_jiffies32;
> if (delta > 0)
> timeout = delta;
> }
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 13/15] tcp_lp: cache tcp_time_stamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> tcp_time_stamp will become slightly more expensive soon,
> cache its value.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_lp.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_lp.c b/net/ipv4/tcp_lp.c
> index 
> d6fb6c067af4641f232b94e7c590c212648e8173..ef3122abb3734a63011fba035f7a7aae431da8de
>  100644
> --- a/net/ipv4/tcp_lp.c
> +++ b/net/ipv4/tcp_lp.c
> @@ -264,18 +264,19 @@ static void tcp_lp_pkts_acked(struct sock *sk, const 
> struct ack_sample *sample)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
> struct lp *lp = inet_csk_ca(sk);
> +   u32 now = tcp_time_stamp;
> u32 delta;
>
> if (sample->rtt_us > 0)
> tcp_lp_rtt_sample(sk, sample->rtt_us);
>
> /* calc inference */
> -   delta = tcp_time_stamp - tp->rx_opt.rcv_tsecr;
> +   delta = now - tp->rx_opt.rcv_tsecr;
> if ((s32)delta > 0)
> lp->inference = 3 * delta;
>
> /* test if within inference */
> -   if (lp->last_drop && (tcp_time_stamp - lp->last_drop < lp->inference))
> +   if (lp->last_drop && (now - lp->last_drop < lp->inference))
> lp->flag |= LP_WITHIN_INF;
> else
> lp->flag &= ~LP_WITHIN_INF;
> @@ -312,7 +313,7 @@ static void tcp_lp_pkts_acked(struct sock *sk, const 
> struct ack_sample *sample)
> tp->snd_cwnd = max(tp->snd_cwnd >> 1U, 1U);
>
> /* record this drop time */
> -   lp->last_drop = tcp_time_stamp;
> +   lp->last_drop = now;
>  }
>
>  static struct tcp_congestion_ops tcp_lp __read_mostly = {
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 09/15] tcp: use tcp_jiffies32 to feed probe_timestamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use tcp_jiffies32 instead of tcp_time_stamp, since
> tcp_time_stamp will soon be only used for TCP TS option.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_output.c | 6 +++---
>  net/ipv4/tcp_timer.c  | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> cbda5de164495cf318960489bd8edf98fe3a5033..f0fd1b4fdb3291638fcdca613d826db2cd27f517
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1475,7 +1475,7 @@ void tcp_mtup_init(struct sock *sk)
> icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, 
> net->ipv4.sysctl_tcp_base_mss);
> icsk->icsk_mtup.probe_size = 0;
> if (icsk->icsk_mtup.enabled)
> -   icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
> +   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
>  }
>  EXPORT_SYMBOL(tcp_mtup_init);
>
> @@ -1987,7 +1987,7 @@ static inline void tcp_mtu_check_reprobe(struct sock 
> *sk)
> s32 delta;
>
> interval = net->ipv4.sysctl_tcp_probe_interval;
> -   delta = tcp_time_stamp - icsk->icsk_mtup.probe_timestamp;
> +   delta = tcp_jiffies32 - icsk->icsk_mtup.probe_timestamp;
> if (unlikely(delta >= interval * HZ)) {
> int mss = tcp_current_mss(sk);
>
> @@ -1999,7 +1999,7 @@ static inline void tcp_mtu_check_reprobe(struct sock 
> *sk)
> icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>
> /* Update probe time stamp */
> -   icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
> +   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
> }
>  }
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 
> 9e0616cb8c17a6385ac97fc0cd657ef9413a1749..6629f47aa7f0182ece7873afcc3daa6f0019e228
>  100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -115,7 +115,7 @@ static void tcp_mtu_probing(struct inet_connection_sock 
> *icsk, struct sock *sk)
> if (net->ipv4.sysctl_tcp_mtu_probing) {
> if (!icsk->icsk_mtup.enabled) {
> icsk->icsk_mtup.enabled = 1;
> -   icsk->icsk_mtup.probe_timestamp = tcp_time_stamp;
> +   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
> tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
> } else {
> struct net *net = sock_net(sk);
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 10/15] tcp: uses jiffies_32 to feed tp->chrono_start

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> tcp_time_stamp will no longer be tied to jiffies.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp.c| 2 +-
>  net/ipv4/tcp_output.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> d0bb61ee28bbceff8f2e27416ce87fec94935973..b85bfe7cb11dca68952cc4be19b169d893963fef
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2757,7 +2757,7 @@ static void tcp_get_info_chrono_stats(const struct 
> tcp_sock *tp,
> for (i = TCP_CHRONO_BUSY; i < __TCP_CHRONO_MAX; ++i) {
> stats[i] = tp->chrono_stat[i - 1];
> if (i == tp->chrono_type)
> -   stats[i] += tcp_time_stamp - tp->chrono_start;
> +   stats[i] += tcp_jiffies32 - tp->chrono_start;
> stats[i] *= USEC_PER_SEC / HZ;
> total += stats[i];
> }
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> f0fd1b4fdb3291638fcdca613d826db2cd27f517..1011ea40c2ba4c12cce21149cab176e1fa4db583
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2202,7 +2202,7 @@ static bool tcp_small_queue_check(struct sock *sk, 
> const struct sk_buff *skb,
>
>  static void tcp_chrono_set(struct tcp_sock *tp, const enum tcp_chrono new)
>  {
> -   const u32 now = tcp_time_stamp;
> +   const u32 now = tcp_jiffies32;
>
> if (tp->chrono_type > TCP_CHRONO_UNSPEC)
> tp->chrono_stat[tp->chrono_type - 1] += now - 
> tp->chrono_start;
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 08/15] tcp: use tcp_jiffies32 for rcv_tstamp and lrcvtime

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use tcp_jiffies32 instead of tcp_time_stamp, since
> tcp_time_stamp will soon be only used for TCP TS option.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  include/net/tcp.h| 4 ++--
>  net/ipv4/tcp_input.c | 6 +++---
>  net/ipv4/tcp_minisocks.c | 2 +-
>  net/ipv4/tcp_output.c| 2 +-
>  net/ipv4/tcp_timer.c | 2 +-
>  5 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 
> feba4c0406e551d7e57da3411476735731b4d817..5b2932b8363fb8546322ebff7c74663139b3371d
>  100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1307,8 +1307,8 @@ static inline u32 keepalive_time_elapsed(const struct 
> tcp_sock *tp)
>  {
> const struct inet_connection_sock *icsk = >inet_conn;
>
> -   return min_t(u32, tcp_time_stamp - icsk->icsk_ack.lrcvtime,
> - tcp_time_stamp - tp->rcv_tstamp);
> +   return min_t(u32, tcp_jiffies32 - icsk->icsk_ack.lrcvtime,
> + tcp_jiffies32 - tp->rcv_tstamp);
>  }
>
>  static inline int tcp_fin_time(const struct sock *sk)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 6a15c9b80b09829799dc37d89ecdbf11ec9ff904..eeb4967df25a8dc35128d0a0848b5ae7ee6d63e3
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -672,7 +672,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
> sk_buff *skb)
>
> tcp_rcv_rtt_measure(tp);
>
> -   now = tcp_time_stamp;
> +   now = tcp_jiffies32;
>
> if (!icsk->icsk_ack.ato) {
> /* The _first_ data packet received, initialize
> @@ -3636,7 +3636,7 @@ static int tcp_ack(struct sock *sk, const struct 
> sk_buff *skb, int flag)
>  */
> sk->sk_err_soft = 0;
> icsk->icsk_probes_out = 0;
> -   tp->rcv_tstamp = tcp_time_stamp;
> +   tp->rcv_tstamp = tcp_jiffies32;
> if (!prior_packets)
> goto no_queue;
>
> @@ -5554,7 +5554,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
> *skb)
> struct inet_connection_sock *icsk = inet_csk(sk);
>
> tcp_set_state(sk, TCP_ESTABLISHED);
> -   icsk->icsk_ack.lrcvtime = tcp_time_stamp;
> +   icsk->icsk_ack.lrcvtime = tcp_jiffies32;
>
> if (skb) {
> icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 
> 717be4de53248352c758b50557987d898340dd4f..59c32e0086c0e46d7955dffe211ec03bb18dcb12
>  100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -447,7 +447,7 @@ struct sock *tcp_create_openreq_child(const struct sock 
> *sk,
> newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
> newicsk->icsk_rto = TCP_TIMEOUT_INIT;
> -   newicsk->icsk_ack.lrcvtime = tcp_time_stamp;
> +   newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
>
> newtp->packets_out = 0;
> newtp->retrans_out = 0;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 4bd50f0b236ba23fe521a76dd9d35ee16acb061f..cbda5de164495cf318960489bd8edf98fe3a5033
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3324,7 +3324,7 @@ static void tcp_connect_init(struct sock *sk)
> if (likely(!tp->repair))
> tp->rcv_nxt = 0;
> else
> -   tp->rcv_tstamp = tcp_time_stamp;
> +   tp->rcv_tstamp = tcp_jiffies32;
> tp->rcv_wup = tp->rcv_nxt;
> tp->copied_seq = tp->rcv_nxt;
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 
> 5f6f219a431e41a90b3c5d667a1a22b50f4464cf..9e0616cb8c17a6385ac97fc0cd657ef9413a1749
>  100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -451,7 +451,7 @@ void tcp_retransmit_timer(struct sock *sk)
> tp->snd_una, tp->snd_nxt);
> }
>  #endif
> -   if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
> +   if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> tcp_write_err(sk);
> goto out;
> }
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 07/15] tcp: bic,cubic: use tcp_jiffies32 instead of tcp_time_stamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use tcp_jiffies32 instead of tcp_time_stamp, since
> tcp_time_stamp will soon be only used for TCP TS option.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_bic.c   |  6 +++---
>  net/ipv4/tcp_cubic.c | 12 ++--
>  2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
> index 
> 36087bca9f489646c2ca5aae3111449a956dd33b..609965f0e29836ed95605a2c7f3170e67c641058
>  100644
> --- a/net/ipv4/tcp_bic.c
> +++ b/net/ipv4/tcp_bic.c
> @@ -84,14 +84,14 @@ static void bictcp_init(struct sock *sk)
>  static inline void bictcp_update(struct bictcp *ca, u32 cwnd)
>  {
> if (ca->last_cwnd == cwnd &&
> -   (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
> +   (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> return;
>
> ca->last_cwnd = cwnd;
> -   ca->last_time = tcp_time_stamp;
> +   ca->last_time = tcp_jiffies32;
>
> if (ca->epoch_start == 0) /* record the beginning of an epoch */
> -   ca->epoch_start = tcp_time_stamp;
> +   ca->epoch_start = tcp_jiffies32;
>
> /* start off normal */
> if (cwnd <= low_window) {
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 
> 2052ca740916d0872a41125ab61b769b334a314b..57ae5b5ae643efad106f5d6ac224ca54a52f9689
>  100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -231,21 +231,21 @@ static inline void bictcp_update(struct bictcp *ca, u32 
> cwnd, u32 acked)
> ca->ack_cnt += acked;   /* count the number of ACKed packets */
>
> if (ca->last_cwnd == cwnd &&
> -   (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
> +   (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> return;
>
> /* The CUBIC function can update ca->cnt at most once per jiffy.
>  * On all cwnd reduction events, ca->epoch_start is set to 0,
>  * which will force a recalculation of ca->cnt.
>  */
> -   if (ca->epoch_start && tcp_time_stamp == ca->last_time)
> +   if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
> goto tcp_friendliness;
>
> ca->last_cwnd = cwnd;
> -   ca->last_time = tcp_time_stamp;
> +   ca->last_time = tcp_jiffies32;
>
> if (ca->epoch_start == 0) {
> -   ca->epoch_start = tcp_time_stamp;   /* record beginning */
> +   ca->epoch_start = tcp_jiffies32;/* record beginning */
> ca->ack_cnt = acked;/* start counting */
> ca->tcp_cwnd = cwnd;/* syn with cubic */
>
> @@ -276,7 +276,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 
> cwnd, u32 acked)
>  * if the cwnd < 1 million packets !!!
>  */
>
> -   t = (s32)(tcp_time_stamp - ca->epoch_start);
> +   t = (s32)(tcp_jiffies32 - ca->epoch_start);
> t += msecs_to_jiffies(ca->delay_min >> 3);
> /* change the unit from HZ to bictcp_HZ */
> t <<= BICTCP_HZ;
> @@ -448,7 +448,7 @@ static void bictcp_acked(struct sock *sk, const struct 
> ack_sample *sample)
> return;
>
> /* Discard delay samples right after fast recovery */
> -   if (ca->epoch_start && (s32)(tcp_time_stamp - ca->epoch_start) < HZ)
> +   if (ca->epoch_start && (s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
> return;
>
> delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 06/15] tcp_bbr: use tcp_jiffies32 instead of tcp_time_stamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use tcp_jiffies32 instead of tcp_time_stamp, since
> tcp_time_stamp will soon be only used for TCP TS option.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_bbr.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
> index 
> 92b045c72163def1c1d6aa0f2002760186aa5dc3..40dc4fc5f6acba91634290e1cacde69a3584248f
>  100644
> --- a/net/ipv4/tcp_bbr.c
> +++ b/net/ipv4/tcp_bbr.c
> @@ -730,12 +730,12 @@ static void bbr_update_min_rtt(struct sock *sk, const 
> struct rate_sample *rs)
> bool filter_expired;
>
> /* Track min RTT seen in the min_rtt_win_sec filter window: */
> -   filter_expired = after(tcp_time_stamp,
> +   filter_expired = after(tcp_jiffies32,
>bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
> if (rs->rtt_us >= 0 &&
> (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
> bbr->min_rtt_us = rs->rtt_us;
> -   bbr->min_rtt_stamp = tcp_time_stamp;
> +   bbr->min_rtt_stamp = tcp_jiffies32;
> }
>
> if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
> @@ -754,7 +754,7 @@ static void bbr_update_min_rtt(struct sock *sk, const 
> struct rate_sample *rs)
> /* Maintain min packets in flight for max(200 ms, 1 round). */
> if (!bbr->probe_rtt_done_stamp &&
> tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
> -   bbr->probe_rtt_done_stamp = tcp_time_stamp +
> +   bbr->probe_rtt_done_stamp = tcp_jiffies32 +
> msecs_to_jiffies(bbr_probe_rtt_mode_ms);
> bbr->probe_rtt_round_done = 0;
> bbr->next_rtt_delivered = tp->delivered;
> @@ -762,8 +762,8 @@ static void bbr_update_min_rtt(struct sock *sk, const 
> struct rate_sample *rs)
> if (bbr->round_start)
> bbr->probe_rtt_round_done = 1;
> if (bbr->probe_rtt_round_done &&
> -   after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) 
> {
> -   bbr->min_rtt_stamp = tcp_time_stamp;
> +   after(tcp_jiffies32, bbr->probe_rtt_done_stamp)) {
> +   bbr->min_rtt_stamp = tcp_jiffies32;
> bbr->restore_cwnd = 1;  /* snap to prior_cwnd 
> */
> bbr_reset_mode(sk);
> }
> @@ -810,7 +810,7 @@ static void bbr_init(struct sock *sk)
> bbr->probe_rtt_done_stamp = 0;
> bbr->probe_rtt_round_done = 0;
> bbr->min_rtt_us = tcp_min_rtt(tp);
> -   bbr->min_rtt_stamp = tcp_time_stamp;
> +   bbr->min_rtt_stamp = tcp_jiffies32;
>
> minmax_reset(>bw, bbr->rtt_cnt, 0);  /* init max bw to 0 */
>
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 05/15] tcp: use tcp_jiffies32 to feed tp->snd_cwnd_stamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use tcp_jiffies32 instead of tcp_time_stamp to feed
> tp->snd_cwnd_stamp.
>
> tcp_time_stamp will soon be a litle bit more expensive
> than simply reading 'jiffies'.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_input.c   | 14 +++---
>  net/ipv4/tcp_metrics.c |  2 +-
>  net/ipv4/tcp_output.c  |  8 
>  3 files changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> c0b3f909df394214785749704f2760171fe9d160..6a15c9b80b09829799dc37d89ecdbf11ec9ff904
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -463,7 +463,7 @@ void tcp_init_buffer_space(struct sock *sk)
> tp->window_clamp = max(2 * tp->advmss, maxwin - tp->advmss);
>
> tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
>  }
>
>  /* 5. Recalculate window clamp after socket hit its memory bounds. */
> @@ -1954,7 +1954,7 @@ void tcp_enter_loss(struct sock *sk)
> }
> tp->snd_cwnd   = 1;
> tp->snd_cwnd_cnt   = 0;
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
>
> tp->retrans_out = 0;
> tp->lost_out = 0;
> @@ -2383,7 +2383,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
> bool unmark_loss)
> tcp_ecn_withdraw_cwr(tp);
> }
> }
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
> tp->undo_marker = 0;
>  }
>
> @@ -2520,7 +2520,7 @@ static inline void tcp_end_cwnd_reduction(struct sock 
> *sk)
> if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR ||
> (tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) {
> tp->snd_cwnd = tp->snd_ssthresh;
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
> }
> tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
>  }
> @@ -2590,7 +2590,7 @@ static void tcp_mtup_probe_success(struct sock *sk)
>tcp_mss_to_mtu(sk, tp->mss_cache) /
>icsk->icsk_mtup.probe_size;
> tp->snd_cwnd_cnt = 0;
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
> tp->snd_ssthresh = tcp_current_ssthresh(sk);
>
> icsk->icsk_mtup.search_low = icsk->icsk_mtup.probe_size;
> @@ -2976,7 +2976,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, 
> u32 acked)
> const struct inet_connection_sock *icsk = inet_csk(sk);
>
> icsk->icsk_ca_ops->cong_avoid(sk, ack, acked);
> -   tcp_sk(sk)->snd_cwnd_stamp = tcp_time_stamp;
> +   tcp_sk(sk)->snd_cwnd_stamp = tcp_jiffies32;
>  }
>
>  /* Restart timer after forward progress on connection.
> @@ -5019,7 +5019,7 @@ static void tcp_new_space(struct sock *sk)
>
> if (tcp_should_expand_sndbuf(sk)) {
> tcp_sndbuf_expand(sk);
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
> }
>
> sk->sk_write_space(sk);
> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
> index 
> 653bbd67e3a39b68d27d26d17571c00ce2854bfd..102b2c90bb807d3a88d31b59324baf72cf901cdf
>  100644
> --- a/net/ipv4/tcp_metrics.c
> +++ b/net/ipv4/tcp_metrics.c
> @@ -524,7 +524,7 @@ void tcp_init_metrics(struct sock *sk)
> tp->snd_cwnd = 1;
> else
> tp->snd_cwnd = tcp_init_cwnd(tp, dst);
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwnd_stamp = tcp_jiffies32;
>  }
>
>  bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst)
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> be9f8f483e21bdbb4d944fcdae8560f3ae11ee64..4bd50f0b236ba23fe521a76dd9d35ee16acb061f
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -151,7 +151,7 @@ void tcp_cwnd_restart(struct sock *sk, s32 delta)
> while ((delta -= inet_csk(sk)->icsk_rto) > 0 && cwnd > restart_cwnd)
> cwnd >>= 1;
> tp->snd_cwnd = max(cwnd, restart_cwnd);
> -   tp->snd_cwnd_stamp = tcp_time_stamp;
> +   tp->snd_cwn

Re: [PATCH net-next 04/15] tcp: use tcp_jiffies32 to feed tp->lsndtime

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use tcp_jiffies32 instead of tcp_time_stamp to feed
> tp->lsndtime.
>
> tcp_time_stamp will soon be a litle bit more expensive
> than simply reading 'jiffies'.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  include/net/tcp.h | 2 +-
>  net/ipv4/tcp.c| 2 +-
>  net/ipv4/tcp_cubic.c  | 2 +-
>  net/ipv4/tcp_input.c  | 4 ++--
>  net/ipv4/tcp_output.c | 4 ++--
>  net/ipv4/tcp_timer.c  | 4 ++--
>  6 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 
> 4b45be5708215bae4551a5430b63ab2777baf447..feba4c0406e551d7e57da3411476735731b4d817
>  100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1245,7 +1245,7 @@ static inline void 
> tcp_slow_start_after_idle_check(struct sock *sk)
> if (!sysctl_tcp_slow_start_after_idle || tp->packets_out ||
> ca_ops->cong_control)
> return;
> -   delta = tcp_time_stamp - tp->lsndtime;
> +   delta = tcp_jiffies32 - tp->lsndtime;
> if (delta > inet_csk(sk)->icsk_rto)
> tcp_cwnd_restart(sk, delta);
>  }
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> 1e4c76d2b8278ba71d6cc2cf7ebfe483e241f76e..d0bb61ee28bbceff8f2e27416ce87fec94935973
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2841,7 +2841,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info 
> *info)
> info->tcpi_retrans = tp->retrans_out;
> info->tcpi_fackets = tp->fackets_out;
>
> -   now = tcp_time_stamp;
> +   now = tcp_jiffies32;
> info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
> info->tcpi_last_data_recv = jiffies_to_msecs(now - 
> icsk->icsk_ack.lrcvtime);
> info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 
> 0683ba447d775b6101a929a6aca3eb255cff8932..2052ca740916d0872a41125ab61b769b334a314b
>  100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -155,7 +155,7 @@ static void bictcp_cwnd_event(struct sock *sk, enum 
> tcp_ca_event event)
>  {
> if (event == CA_EVENT_TX_START) {
> struct bictcp *ca = inet_csk_ca(sk);
> -   u32 now = tcp_time_stamp;
> +   u32 now = tcp_jiffies32;
> s32 delta;
>
> delta = now - tcp_sk(sk)->lsndtime;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 06e2dbc2b4a212a054fd88e57bb902c55a171b11..c0b3f909df394214785749704f2760171fe9d160
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5571,7 +5571,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
> *skb)
> /* Prevent spurious tcp_cwnd_restart() on first data
>  * packet.
>  */
> -   tp->lsndtime = tcp_time_stamp;
> +   tp->lsndtime = tcp_jiffies32;
>
> tcp_init_buffer_space(sk);
>
> @@ -6008,7 +6008,7 @@ int tcp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb)
> tcp_update_pacing_rate(sk);
>
> /* Prevent spurious tcp_cwnd_restart() on first data packet */
> -   tp->lsndtime = tcp_time_stamp;
> +   tp->lsndtime = tcp_jiffies32;
>
> tcp_initialize_rcv_mss(sk);
> tcp_fast_path_on(tp);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 4c8a6eaba6b39a2aea061dd6857ed8df954c5ca2..be9f8f483e21bdbb4d944fcdae8560f3ae11ee64
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -160,7 +160,7 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
> struct sock *sk)
>  {
> struct inet_connection_sock *icsk = inet_csk(sk);
> -   const u32 now = tcp_time_stamp;
> +   const u32 now = tcp_jiffies32;
>
> if (tcp_packets_in_flight(tp) == 0)
> tcp_ca_event(sk, CA_EVENT_TX_START);
> @@ -1918,7 +1918,7 @@ static bool tcp_tso_should_defer(struct sock *sk, 
> struct sk_buff *skb,
> /* Avoid bursty behavior by allowing defer
>  * only if the last write was recent.
>  */
> -   if ((s32)(tcp_time_stamp - tp->lsndtime) > 0)
> +   if ((s32)(tcp_jiffies32 - tp->lsndtime) > 0)
> goto send_now;
>
> in_flight = tcp_packets_in_flight(tp);
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 
&

Re: [PATCH net-next 03/15] dccp: do not use tcp_time_stamp

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Use our own macro instead of abusing tcp_time_stamp
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/dccp/ccids/ccid2.c | 8 
>  net/dccp/ccids/ccid2.h | 2 +-
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
> index 
> 5e3a7302f7747e4c4f3134eacab2f2c65b13402f..e1295d5f2c562e8785f59a0f5bd7064f471e85ab
>  100644
> --- a/net/dccp/ccids/ccid2.c
> +++ b/net/dccp/ccids/ccid2.c
> @@ -233,7 +233,7 @@ static void ccid2_hc_tx_packet_sent(struct sock *sk, 
> unsigned int len)
>  {
> struct dccp_sock *dp = dccp_sk(sk);
> struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
> -   const u32 now = ccid2_time_stamp;
> +   const u32 now = ccid2_jiffies32;
> struct ccid2_seq *next;
>
> /* slow-start after idle periods (RFC 2581, RFC 2861) */
> @@ -466,7 +466,7 @@ static void ccid2_new_ack(struct sock *sk, struct 
> ccid2_seq *seqp,
>  * The cleanest solution is to not use the ccid2s_sent field at all
>  * and instead use DCCP timestamps: requires changes in other places.
>  */
> -   ccid2_rtt_estimator(sk, ccid2_time_stamp - seqp->ccid2s_sent);
> +   ccid2_rtt_estimator(sk, ccid2_jiffies32 - seqp->ccid2s_sent);
>  }
>
>  static void ccid2_congestion_event(struct sock *sk, struct ccid2_seq *seqp)
> @@ -478,7 +478,7 @@ static void ccid2_congestion_event(struct sock *sk, 
> struct ccid2_seq *seqp)
> return;
> }
>
> -   hc->tx_last_cong = ccid2_time_stamp;
> +   hc->tx_last_cong = ccid2_jiffies32;
>
> hc->tx_cwnd  = hc->tx_cwnd / 2 ? : 1U;
> hc->tx_ssthresh  = max(hc->tx_cwnd, 2U);
> @@ -731,7 +731,7 @@ static int ccid2_hc_tx_init(struct ccid *ccid, struct 
> sock *sk)
>
> hc->tx_rto   = DCCP_TIMEOUT_INIT;
> hc->tx_rpdupack  = -1;
> -   hc->tx_last_cong = hc->tx_lsndtime = hc->tx_cwnd_stamp = 
> ccid2_time_stamp;
> +   hc->tx_last_cong = hc->tx_lsndtime = hc->tx_cwnd_stamp = 
> ccid2_jiffies32;
> hc->tx_cwnd_used = 0;
> setup_timer(>tx_rtotimer, ccid2_hc_tx_rto_expire,
> (unsigned long)sk);
> diff --git a/net/dccp/ccids/ccid2.h b/net/dccp/ccids/ccid2.h
> index 
> 18c97543e522a6b9a5c8a3c817d4b40224adde48..6e50ef2898fb9dd9080217cc167defea6a2e9021
>  100644
> --- a/net/dccp/ccids/ccid2.h
> +++ b/net/dccp/ccids/ccid2.h
> @@ -27,7 +27,7 @@
>   * CCID-2 timestamping faces the same issues as TCP timestamping.
>   * Hence we reuse/share as much of the code as possible.
>   */
> -#define ccid2_time_stamp   tcp_time_stamp
> +#define ccid2_jiffies32((u32)jiffies)
>
>  /* NUMDUPACK parameter from RFC 4341, p. 6 */
>  #define NUMDUPACK  3
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 02/15] tcp: introduce tcp_jiffies32

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> We abuse tcp_time_stamp for two different cases :
>
> 1) base to generate TCP Timestamp options (RFC 7323)
>
> 2) A 32bit version of jiffies since some TCP fields
>are 32bit wide to save memory.
>
> Since we want in the future to have 1ms TCP TS clock,
> regardless of HZ value, we want to cleanup things.
>
> tcp_jiffies32 is the truncated jiffies value,
> which will be used only in places where we want a 'host'
> timestamp.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  include/net/tcp.h | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 
> b4dc93dae98c2d175ccadce150083705d237555e..4b45be5708215bae4551a5430b63ab2777baf447
>  100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -700,11 +700,14 @@ u32 __tcp_select_window(struct sock *sk);
>
>  void tcp_send_window_probe(struct sock *sk);
>
> -/* TCP timestamps are only 32-bits, this causes a slight
> - * complication on 64-bit systems since we store a snapshot
> - * of jiffies in the buffer control blocks below.  We decided
> - * to use only the low 32-bits of jiffies and hide the ugly
> - * casts with the following macro.
> +/* TCP uses 32bit jiffies to save some space.
> + * Note that this is different from tcp_time_stamp, which
> + * historically has been the same until linux-4.13.
> + */
> +#define tcp_jiffies32 ((u32)jiffies)
> +
> +/* Generator for TCP TS option (RFC 7323)
> + * Currently tied to 'jiffies' but will soon be driven by 1 ms clock.
>   */
>  #define tcp_time_stamp ((__u32)(jiffies))
>
> --
> 2.13.0.303.g4ebf302169-goog
>

Re: [PATCH net-next 01/15] tcp: use tp->tcp_mstamp in output path

2017-05-17 Thread Soheil Hassas Yeganeh

On Tue, May 16, 2017 at 5:00 PM, Eric Dumazet <eduma...@google.com> wrote:
> Idea is to later convert tp->tcp_mstamp to a full u64 counter
> using usec resolution, so that we can later have fine
> grained TCP TS clock (RFC 7323), regardless of HZ value.
>
> We try to refresh tp->tcp_mstamp only when necessary.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_ipv4.c |  1 +
>  net/ipv4/tcp_output.c   | 21 +++--
>  net/ipv4/tcp_recovery.c |  1 -
>  net/ipv4/tcp_timer.c|  3 ++-
>  4 files changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 
> 5ab2aac5ca191075383fc75214da816873bb222c..d8fe25db79f223e3fde85882effd2ac6ec15f8ca
>  100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -483,6 +483,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
> skb = tcp_write_queue_head(sk);
> BUG_ON(!skb);
>
> +   skb_mstamp_get(>tcp_mstamp);
> remaining = icsk->icsk_rto -
> min(icsk->icsk_rto,
> tcp_time_stamp - tcp_skb_timestamp(skb));
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> a32172d69a03cbe76b45ec3094222f6c3a73e27d..4c8a6eaba6b39a2aea061dd6857ed8df954c5ca2
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -997,8 +997,8 @@ static int tcp_transmit_skb(struct sock *sk, struct 
> sk_buff *skb, int clone_it,
> BUG_ON(!skb || !tcp_skb_pcount(skb));
> tp = tcp_sk(sk);
>
> +   skb->skb_mstamp = tp->tcp_mstamp;
> if (clone_it) {
> -   skb_mstamp_get(>skb_mstamp);
> TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq
> - tp->snd_una;
> tcp_rate_skb_sent(sk, skb);
> @@ -1906,7 +1906,6 @@ static bool tcp_tso_should_defer(struct sock *sk, 
> struct sk_buff *skb,
> const struct inet_connection_sock *icsk = inet_csk(sk);
> u32 age, send_win, cong_win, limit, in_flight;
> struct tcp_sock *tp = tcp_sk(sk);
> -   struct skb_mstamp now;
> struct sk_buff *head;
> int win_divisor;
>
> @@ -1962,8 +1961,8 @@ static bool tcp_tso_should_defer(struct sock *sk, 
> struct sk_buff *skb,
> }
>
> head = tcp_write_queue_head(sk);
> -   skb_mstamp_get();
> -   age = skb_mstamp_us_delta(, >skb_mstamp);
> +
> +   age = skb_mstamp_us_delta(>tcp_mstamp, >skb_mstamp);
> /* If next ACK is likely to come too late (half srtt), do not defer */
> if (age < (tp->srtt_us >> 4))
> goto send_now;
> @@ -2280,6 +2279,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
> int mss_now, int nonagle,
> }
>
> max_segs = tcp_tso_segs(sk, mss_now);
> +   skb_mstamp_get(>tcp_mstamp);
> while ((skb = tcp_send_head(sk))) {
> unsigned int limit;
>
> @@ -2291,7 +2291,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
> int mss_now, int nonagle,
>
> if (unlikely(tp->repair) && tp->repair_queue == 
> TCP_SEND_QUEUE) {
> /* "skb_mstamp" is used as a start point for the 
> retransmit timer */
> -   skb_mstamp_get(>skb_mstamp);
> +   skb->skb_mstamp = tp->tcp_mstamp;
> goto repair; /* Skip network transmission */
> }
>
> @@ -2879,7 +2879,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct 
> sk_buff *skb, int segs)
>  skb_headroom(skb) >= 0x)) {
> struct sk_buff *nskb;
>
> -   skb_mstamp_get(>skb_mstamp);
> +   skb->skb_mstamp = tp->tcp_mstamp;
> nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
> err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
>  -ENOBUFS;
> @@ -3095,7 +3095,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
> priority)
> skb_reserve(skb, MAX_TCP_HEADER);
> tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
>  TCPHDR_ACK | TCPHDR_RST);
> -   skb_mstamp_get(>skb_mstamp);
> +   skb_mstamp_get(_sk(sk)->tcp_mstamp);
> /* Send it off. */
> if (tcp_transmit_skb(sk, skb, 0, priority))
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
> @@ -3453,7 +3453,8 @@ int tcp_c

[PATCH net-next] tcp: warn on negative reordering values

2017-05-16 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Commit bafbb9c73241 ("tcp: eliminate negative reordering
in tcp_clean_rtx_queue") fixes an issue for negative
reordering metrics.

To be resilient to such errors, warn and return
when a negative metric is passed to tcp_update_reordering().

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f27dff64e59e..eb5eb87060a2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -886,6 +886,9 @@ static void tcp_update_reordering(struct sock *sk, const 
int metric,
struct tcp_sock *tp = tcp_sk(sk);
int mib_idx;
 
+   if (WARN_ON_ONCE(metric < 0))
+   return;
+
if (metric > tp->reordering) {
tp->reordering = min(sysctl_tcp_max_reordering, metric);
 
-- 
2.13.0.303.g4ebf302169-goog

Re: [PATCH net-next] tcp: internal implementation for pacing

2017-05-15 Thread Soheil Hassas Yeganeh

On Mon, May 15, 2017 at 11:43 PM, Eric Dumazet <eduma...@google.com> wrote:
> BBR congestion control depends on pacing, and pacing is
> currently handled by sch_fq packet scheduler for performance reasons,
> and also because implemening pacing with FQ was convenient to truly
> avoid bursts.
>
> However there are many cases where this packet scheduler constraint
> is not practical.
> - Many linux hosts are not focusing on handling thousands of TCP
>   flows in the most efficient way.
> - Some routers use fq_codel or other AQM, but still would like
>   to use BBR for the few TCP flows they initiate/terminate.
>
> This patch implements an automatic fallback to internal pacing.
>
> Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
>
> If sch_fq happens to be in the egress path, pacing is delegated to
> the qdisc, otherwise pacing is done by TCP itself.
>
> One advantage of pacing from TCP stack is to get more precise rtt
> estimations, and less work done from TX completion, since TCP Small
> queue limits are not generally hit. Setups with single TX queue but
> many cpus might even benefit from this.
>
> Note that unlike sch_fq, we do not take into account header sizes.
> Taking care of these headers would add additional complexity for
> no practical differences in behavior.
>
> Some performance numbers using 800 TCP_STREAM flows rate limited to
> ~48 Mbit per second on 40Gbit NIC.
>
> If MQ+pfifo_fast is used on the NIC :
>
> $ sar -n DEV 1 5 | grep eth
> 14:48:44 eth0 725743.00 2932134.00  46776.76 4335184.68  0.00 
>  0.00  1.00
> 14:48:45 eth0 725349.00 2932112.00  46751.86 4335158.90  0.00 
>  0.00  0.00
> 14:48:46 eth0 725101.00 2931153.00  46735.07 4333748.63  0.00 
>  0.00  0.00
> 14:48:47 eth0 725099.00 2931161.00  46735.11 4333760.44  0.00 
>  0.00  1.00
> 14:48:48 eth0 725160.00 2931731.00  46738.88 4334606.07  0.00 
>  0.00  0.00
> Average: eth0 725290.40 2931658.20  46747.54 4334491.74  0.00 
>  0.00  0.40
> $ vmstat 1 5
> procs ---memory-- ---swap-- -io -system-- 
> --cpu-
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa 
> st
>  4  0  0 259825920  45644 27083240021 2  247   98  0  0 
> 100  0  0
>  4  0  0 259823744  45644 270835600 0 0 2400825 159843  0 
> 19 81  0  0
>  0  0  0 259824208  45644 270807200 0 0 2407351 159929  0 
> 19 81  0  0
>  1  0  0 259824592  45644 270812800 0 0 2405183 160386  0 
> 19 80  0  0
>  1  0  0 259824272  45644 270786800 032 2396361 158037  0 
> 19 81  0  0
>
> Now use MQ+FQ :
>
> lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
> lpaa23:~# tc qdisc replace dev eth0 root mq
>
> $ sar -n DEV 1 5 | grep eth
> 14:49:57 eth0 678614.00 2727930.00  43739.13 4033279.14  0.00 
>  0.00  0.00
> 14:49:58 eth0 677620.00 2723971.00  43674.69 4027429.62  0.00 
>  0.00  1.00
> 14:49:59 eth0 676396.00 2719050.00  43596.83 4020125.02  0.00 
>  0.00  0.00
> 14:50:00 eth0 675197.00 2714173.00  43518.62 4012938.90  0.00 
>  0.00  1.00
> 14:50:01 eth0 676388.00 2719063.00  43595.47 4020171.64  0.00 
>  0.00  0.00
> Average: eth0 676843.00 2720837.40  43624.95 4022788.86  0.00 
>  0.00  0.40
> $ vmstat 1 5
> procs ---memory-- ---swap-- -io -system-- 
> --cpu-
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa 
> st
>  2  0  0 259832240  46008 27109120021 2  223  192  0  1 
> 99  0  0
>  1  0  0 259832896  46008 271074400 0 0 1702206 198078  0 
> 17 82  0  0
>  0  0  0 259830272  46008 271059600 0 0 1696340 197756  1 
> 17 83  0  0
>  4  0  0 259829168  46024 27105840    0    16 0 1688472 197158  1 
> 17 82  0  0
>  3  0  0 259830224  46024 271040800 0 0 1692450 197212  0 
> 18 82  0  0
>
> As expected, number of interrupts per second is very different.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> Cc: Neal Cardwell <ncardw...@google.com>
> Cc: Yuchung Cheng <ych...@google.com>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Van Jacobson <v...@google.com>
> Cc: Jerry Chu <hk...@google.com>
> ---
>  include/linux/tcp.h   |  2 ++
>  include/net/sock.h|  8 +-
>  i

[PATCH net] tcp: eliminate negative reordering in tcp_clean_rtx_queue

2017-05-15 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

tcp_ack() can call tcp_fragment() which may dededuct the
value tp->fackets_out when MSS changes. When prior_fackets
is larger than tp->fackets_out, tcp_clean_rtx_queue() can
invoke tcp_update_reordering() with negative values. This
results in absurd tp->reodering values higher than
sysctl_tcp_max_reordering.

Note that tcp_update_reordering indeeds sets tp->reordering
to min(sysctl_tcp_max_reordering, metric), but because
the comparison is signed, a negative metric always wins.

Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes")
Reported-by: Rebecca Isaacs <risa...@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9739962bfb3f..f27dff64e59e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3190,7 +3190,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets,
int delta;
 
/* Non-retransmitted hole got filled? That's reordering 
*/
-   if (reord < prior_fackets)
+   if (reord < prior_fackets && reord <= tp->fackets_out)
tcp_update_reordering(sk, tp->fackets_out - 
reord, 0);
 
delta = tcp_is_fack(tp) ? pkts_acked :
-- 
2.13.0.rc2.291.g57267f2277-goog

Re: [PATCH] net/packet: fix missing net_device reference release

2017-05-12 Thread Soheil Hassas Yeganeh

On Fri, May 12, 2017 at 2:19 PM, Douglas Caetano dos Santos
<dougla...@taghos.com.br> wrote:
> When using a TX ring buffer, if an error occurs processing a control
> message (e.g. invalid message), the net_device reference is not
> released.
>
> Fixes c14ac9451c348 ("sock: enable timestamping using control messages")
> Signed-off-by: Douglas Caetano dos Santos <dougla...@taghos.com.br>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/packet/af_packet.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index f4001763134d..e3eeed19cc7a 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2658,13 +2658,6 @@ static int tpacket_snd(struct packet_sock *po, struct 
> msghdr *msg)
> dev = dev_get_by_index(sock_net(>sk), saddr->sll_ifindex);
> }
>
> -   sockc.tsflags = po->sk.sk_tsflags;
> -   if (msg->msg_controllen) {
> -   err = sock_cmsg_send(>sk, msg, );
> -   if (unlikely(err))
> -   goto out;
> -   }
> -
> err = -ENXIO;
> if (unlikely(dev == NULL))
> goto out;
> @@ -2672,6 +2665,13 @@ static int tpacket_snd(struct packet_sock *po, struct 
> msghdr *msg)
> if (unlikely(!(dev->flags & IFF_UP)))
> goto out_put;
>
> +   sockc.tsflags = po->sk.sk_tsflags;
> +   if (msg->msg_controllen) {
> +   err = sock_cmsg_send(>sk, msg, );
> +   if (unlikely(err))
> +   goto out_put;
> +   }
> +
> if (po->sk.sk_socket->type == SOCK_RAW)
> reserve = dev->hard_header_len;
> size_max = po->tx_ring.frame_size
> --
> 2.12.2
>

Thank you, for the fix.

Re: [PATCH net-next] tcp: tcp_rack_reo_timeout() must update tp->tcp_mstamp

2017-04-27 Thread Soheil Hassas Yeganeh

On Thu, Apr 27, 2017 at 12:10 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> I wrongly assumed tp->tcp_mstamp was up to date at the time
> tcp_rack_reo_timeout() was called.
>
> It is not true, since we only update tcp->tcp_mstamp when receiving
> a packet (as initially done in commit 69e996c58a35 ("tcp: add
> tp->tcp_mstamp field")
>
> tcp_rack_reo_timeout() being called by a timer and not an incoming
> packet, we need to refresh tp->tcp_mstamp
>
> Fixes: 7c1c7308592f ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Neal Cardwell <ncardw...@google.com>
> Cc: Yuchung Cheng <ych...@google.com>
> ---
>  net/ipv4/tcp_recovery.c |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
> index 
> cd72b3d3879e88181c8a4639f0334a24e4cda852..362b8c75bfab44cf87c2a01398a146a271bc1119
>  100644
> --- a/net/ipv4/tcp_recovery.c
> +++ b/net/ipv4/tcp_recovery.c
> @@ -166,6 +166,7 @@ void tcp_rack_reo_timeout(struct sock *sk)
> u32 timeout, prior_inflight;
>
> prior_inflight = tcp_packets_in_flight(tp);
> +   skb_mstamp_get(>tcp_mstamp);
> tcp_rack_detect_loss(sk, );
> if (prior_inflight != tcp_packets_in_flight(tp)) {
> if (inet_csk(sk)->icsk_ca_state != TCP_CA_Recovery) {
>
>

Thanks for the fix!

Re: [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field

2017-04-25 Thread Soheil Hassas Yeganeh

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <eduma...@google.com> wrote:
> We want to use precise timestamps in TCP stack, but we do not
> want to call possibly expensive kernel time services too often.
>
> tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
>
> We will use it in the following patches, removing specific
> skb_mstamp_get() calls, and removing ack_time from
> struct tcp_sacktag_state.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  include/linux/tcp.h  | 1 +
>  net/ipv4/tcp_input.c | 3 +++
>  2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 
> cbe5b602a2d349fdeb1e878305f37b4da1e6cc86..99a22f44c32e1587a6bf4835b65c7a4314807aa8
>  100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -240,6 +240,7 @@ struct tcp_sock {
> u32 tlp_high_seq;   /* snd_nxt at the time of TLP retransmit. */
>
>  /* RTT measurement */
> +   struct skb_mstamp tcp_mstamp; /* most recent packet received/sent */
> u32 srtt_us;/* smoothed round trip time << 3 in usecs */
> u32 mdev_us;/* medium deviation */
> u32 mdev_max_us;/* maximal mdev for the last rtt period */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 5af2f04f885914491a7116c20056b3d2188d2d7d..bd18c65df4a9d9c2b66d8005f2cc4ff468140a73
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5362,6 +5362,7 @@ void tcp_rcv_established(struct sock *sk, struct 
> sk_buff *skb,
>  {
> struct tcp_sock *tp = tcp_sk(sk);
>
> +   skb_mstamp_get(>tcp_mstamp);
> if (unlikely(!sk->sk_rx_dst))
> inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
> /*
> @@ -5922,6 +5923,7 @@ int tcp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb)
>
> case TCP_SYN_SENT:
> tp->rx_opt.saw_tstamp = 0;
> +   skb_mstamp_get(>tcp_mstamp);
> queued = tcp_rcv_synsent_state_process(sk, skb, th);
> if (queued >= 0)
> return queued;
> @@ -5933,6 +5935,7 @@ int tcp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb)
> return 0;
> }
>
> +   skb_mstamp_get(>tcp_mstamp);
> tp->rx_opt.saw_tstamp = 0;
> req = tp->fastopen_rsk;
> if (req) {
> --
> 2.13.0.rc0.306.g87b477812d-goog
>

Nice patchset. Thanks, Eric!

Re: [PATCH net-next 1/2] tcp: remove poll() flakes when receiving RST

2017-04-18 Thread Soheil Hassas Yeganeh

On Tue, Apr 18, 2017 at 12:45 PM, Eric Dumazet <eduma...@google.com> wrote:
> When a RST packet is processed, we send two wakeup events to interested
> polling users.
>
> First one by a sk->sk_error_report(sk) from tcp_reset(),
> followed by a sk->sk_state_change(sk) from tcp_done().
>
> Depending on machine load and luck, poll() can either return POLLERR,
> or POLLIN|POLLOUT|POLLERR|POLLHUP (this happens on 99 % of the cases)
>
> This is probably fine, but we can avoid the confusion by reordering
> things so that we have more TCP fields updated before the first wakeup.
>
> This might even allow us to remove some barriers we added in the past.
>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> a5838858c362cd3270296001ceaae341e9e9bf01..37e2aa925f62395cfb48145cd3a76b6afebb64b1
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4008,10 +4008,10 @@ void tcp_reset(struct sock *sk)
> /* This barrier is coupled with smp_rmb() in tcp_poll() */
> smp_wmb();
>
> +   tcp_done(sk);
> +
> if (!sock_flag(sk, SOCK_DEAD))
> sk->sk_error_report(sk);
> -
> -   tcp_done(sk);
>  }
>
>  /*
> --
> 2.12.2.762.g0e3151a226-goog
>

Thanks, Eric. Nice improvement!

Re: [PATCH] ipv6: make sure to initialize sockc.tsflags before first use

2017-03-21 Thread Soheil Hassas Yeganeh

On Tue, Mar 21, 2017 at 12:14 PM, Alexander Potapenko <gli...@google.com> wrote:
> In the case udp_sk(sk)->pending is AF_INET6, udpv6_sendmsg() would
> jump to do_append_data, skipping the initialization of sockc.tsflags.
> Fix the problem by moving sockc.tsflags initialization earlier.
>
> The bug was detected with KMSAN.

Nice catch and thanks for the fix! This is missing a "fixes"
attribution, added below.

> Signed-off-by: Alexander Potapenko <gli...@google.com>

Fixes: c14ac9451c34 ("sock: enable timestamping using control messages")
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

[PATCH net 2/2] tcp: mark skbs with SCM_TIMESTAMPING_OPT_STATS

2017-03-18 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

SOF_TIMESTAMPING_OPT_STATS can be enabled and disabled
while packets are collected on the error queue.
So, checking SOF_TIMESTAMPING_OPT_STATS in sk->sk_tsflags
is not enough to safely assume that the skb contains
OPT_STATS data.

Add a bit in sock_exterr_skb to indicate whether the
skb contains opt_stats data.

Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for 
SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru...@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
---
 include/linux/errqueue.h |  2 ++
 net/core/skbuff.c| 17 +++--
 net/socket.c |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/errqueue.h b/include/linux/errqueue.h
index 9ca23fcfb5d7..6fdfc884fdeb 100644
--- a/include/linux/errqueue.h
+++ b/include/linux/errqueue.h
@@ -20,6 +20,8 @@ struct sock_exterr_skb {
struct sock_extended_erree;
u16 addr_offset;
__be16  port;
+   u8  opt_stats:1,
+   unused:7;
 };
 
 #endif
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b1fbd1958eb6..9f781092fda9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3793,16 +3793,20 @@ EXPORT_SYMBOL(skb_clone_sk);
 
 static void __skb_complete_tx_timestamp(struct sk_buff *skb,
struct sock *sk,
-   int tstype)
+   int tstype,
+   bool opt_stats)
 {
struct sock_exterr_skb *serr;
int err;
 
+   BUILD_BUG_ON(sizeof(struct sock_exterr_skb) > sizeof(skb->cb));
+
serr = SKB_EXT_ERR(skb);
memset(serr, 0, sizeof(*serr));
serr->ee.ee_errno = ENOMSG;
serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
serr->ee.ee_info = tstype;
+   serr->opt_stats = opt_stats;
if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
serr->ee.ee_data = skb_shinfo(skb)->tskey;
if (sk->sk_protocol == IPPROTO_TCP &&
@@ -3843,7 +3847,7 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 */
if (likely(atomic_inc_not_zero(>sk_refcnt))) {
*skb_hwtstamps(skb) = *hwtstamps;
-   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
+   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND, false);
sock_put(sk);
}
 }
@@ -3854,7 +3858,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 struct sock *sk, int tstype)
 {
struct sk_buff *skb;
-   bool tsonly;
+   bool tsonly, opt_stats = false;
 
if (!sk)
return;
@@ -3867,9 +3871,10 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 #ifdef CONFIG_INET
if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
sk->sk_protocol == IPPROTO_TCP &&
-   sk->sk_type == SOCK_STREAM)
+   sk->sk_type == SOCK_STREAM) {
skb = tcp_get_timestamping_opt_stats(sk);
-   else
+   opt_stats = true;
+   } else
 #endif
skb = alloc_skb(0, GFP_ATOMIC);
} else {
@@ -3888,7 +3893,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
else
skb->tstamp = ktime_get_real();
 
-   __skb_complete_tx_timestamp(skb, sk, tstype);
+   __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 
diff --git a/net/socket.c b/net/socket.c
index 692d6989d2c2..985ef06792d6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -706,7 +706,7 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock 
*sk,
 SCM_TIMESTAMPING, sizeof(tss), );
 
if (skb_is_err_queue(skb) && skb->len &&
-   (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS))
+   SKB_EXT_ERR(skb)->opt_stats)
put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_OPT_STATS,
 skb->len, skb->data);
}
-- 
2.12.0.367.g23dc2f6d3c-goog

[PATCH net 1/2] tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs

2017-03-18 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

__sock_recv_timestamp can be called for both normal skbs (for
receive timestamps) and for skbs on the error queue (for transmit
timestamps).

Commit 1c885808e456
(tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING)
assumes any skb passed to __sock_recv_timestamp are from
the error queue, containing OPT_STATS in the content of the skb.
This results in accessing invalid memory or generating junk
data.

To fix this, set skb->pkt_type to PACKET_OUTGOING for packets
on the error queue. This is safe because on the receive path
on local sockets skb->pkt_type is never set to PACKET_OUTGOING.
With that, copy OPT_STATS from a packet, only if its pkt_type
is PACKET_OUTGOING.

Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for 
SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru...@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
---
 net/core/skbuff.c | 10 ++
 net/socket.c  | 13 -
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cd4ba8c6b609..b1fbd1958eb6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3694,6 +3694,15 @@ static void sock_rmem_free(struct sk_buff *skb)
atomic_sub(skb->truesize, >sk_rmem_alloc);
 }
 
+static void skb_set_err_queue(struct sk_buff *skb)
+{
+   /* pkt_type of skbs received on local sockets is never PACKET_OUTGOING.
+* So, it is safe to (mis)use it to mark skbs on the error queue.
+*/
+   skb->pkt_type = PACKET_OUTGOING;
+   BUILD_BUG_ON(PACKET_OUTGOING == 0);
+}
+
 /*
  * Note: We dont mem charge error packets (no sk_forward_alloc changes)
  */
@@ -3707,6 +3716,7 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff 
*skb)
skb->sk = sk;
skb->destructor = sock_rmem_free;
atomic_add(skb->truesize, >sk_rmem_alloc);
+   skb_set_err_queue(skb);
 
/* before exiting rcu section, make sure dst is refcounted */
skb_dst_force(skb);
diff --git a/net/socket.c b/net/socket.c
index e034fe4164be..692d6989d2c2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -652,6 +652,16 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(kernel_sendmsg);
 
+static bool skb_is_err_queue(const struct sk_buff *skb)
+{
+   /* pkt_type of skbs enqueued on the error queue are set to
+* PACKET_OUTGOING in skb_set_err_queue(). This is only safe to do
+* in recvmsg, since skbs received on a local socket will never
+* have a pkt_type of PACKET_OUTGOING.
+*/
+   return skb->pkt_type == PACKET_OUTGOING;
+}
+
 /*
  * called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
  */
@@ -695,7 +705,8 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock 
*sk,
put_cmsg(msg, SOL_SOCKET,
 SCM_TIMESTAMPING, sizeof(tss), );
 
-   if (skb->len && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS))
+   if (skb_is_err_queue(skb) && skb->len &&
+   (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS))
put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_OPT_STATS,
 skb->len, skb->data);
}
-- 
2.12.0.367.g23dc2f6d3c-goog

[PATCH net-next 1/2] tcp: remove per-destination timestamp cache

2017-03-15 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
randomizes TCP timestamps per connection. After this commit,
there is no guarantee that the timestamps received from the
same destination are monotonically increasing. As a result,
the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
in struct tcp_metrics_block) is broken and cannot be relied upon.

Remove the per-destination timestamp cache and all related code
paths.

Note that this cache was already broken for caching timestamps of
multiple machines behind a NAT sharing the same address.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Cc: Lutz Vieweg <l...@5t9.de>
Cc: Florian Westphal <f...@strlen.de>
---
 include/net/tcp.h|   6 +-
 net/ipv4/tcp_input.c |   6 +-
 net/ipv4/tcp_ipv4.c  |   4 --
 net/ipv4/tcp_metrics.c   | 147 ++-
 net/ipv4/tcp_minisocks.c |  22 ++-
 net/ipv6/tcp_ipv6.c  |   5 --
 6 files changed, 11 insertions(+), 179 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index bede8f7fa742..c81f3b958d44 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -406,11 +406,7 @@ void tcp_clear_retrans(struct tcp_sock *tp);
 void tcp_update_metrics(struct sock *sk);
 void tcp_init_metrics(struct sock *sk);
 void tcp_metrics_init(void);
-bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
-   bool paws_check, bool timestamps);
-bool tcp_remember_stamp(struct sock *sk);
-bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw);
-void tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst);
+bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
 void tcp_disable_fack(struct tcp_sock *tp);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96b67a8b18c3..aafec0676d3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6342,8 +6342,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
dst = af_ops->route_req(sk, , req, );
 
if (dst && strict &&
-   !tcp_peer_is_proven(req, dst, true,
-   tmp_opt.saw_tstamp)) {
+   !tcp_peer_is_proven(req, dst)) {
NET_INC_STATS(sock_net(sk), 
LINUX_MIB_PAWSPASSIVEREJECTED);
goto drop_and_release;
}
@@ -6352,8 +6351,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
else if (!net->ipv4.sysctl_tcp_syncookies &&
 (net->ipv4.sysctl_max_syn_backlog - 
inet_csk_reqsk_queue_len(sk) <
  (net->ipv4.sysctl_max_syn_backlog >> 2)) &&
-!tcp_peer_is_proven(req, dst, false,
-tmp_opt.saw_tstamp)) {
+!tcp_peer_is_proven(req, dst)) {
/* Without syncookies last quarter of
 * backlog is filled with destinations,
 * proven to be alive.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 08d870e45658..d8b401fff9fe 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -198,10 +198,6 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len)
tp->write_seq  = 0;
}
 
-   if (tcp_death_row->sysctl_tw_recycle &&
-   !tp->rx_opt.ts_recent_stamp && fl4->daddr == daddr)
-   tcp_fetch_timewait_stamp(sk, >dst);
-
inet->inet_dport = usin->sin_port;
sk_daddr_set(sk, daddr);
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 0f46e5fe31ad..9d0d4f39e42b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -45,8 +45,6 @@ struct tcp_metrics_block {
struct inetpeer_addrtcpm_saddr;
struct inetpeer_addrtcpm_daddr;
unsigned long   tcpm_stamp;
-   u32 tcpm_ts;
-   u32 tcpm_ts_stamp;
u32 tcpm_lock;
u32 tcpm_vals[TCP_METRIC_MAX_KERNEL + 1];
struct tcp_fastopen_metrics tcpm_fastopen;
@@ -123,8 +121,6 @@ static void tcpm_suck_dst(struct tcp_metrics_block *tm,
tm->tcpm_vals[TCP_METRIC_SSTHRESH] = dst_metric_raw(dst, RTAX_SSTHRESH);
tm->tcpm_vals[TCP_METRIC_CWND] = dst_metric_raw(dst, RTAX_CWND)

[PATCH net-next 2/2] tcp: remove tcp_tw_recycle

2017-03-15 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.

After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.

Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Yuchung Cheng <ych...@google.com>
Cc: Lutz Vieweg <l...@5t9.de>
Cc: Florian Westphal <f...@strlen.de>
---
 Documentation/networking/ip-sysctl.txt |  5 -
 include/net/netns/ipv4.h   |  1 -
 include/net/tcp.h  |  3 +--
 include/uapi/linux/snmp.h  |  1 -
 net/ipv4/proc.c|  1 -
 net/ipv4/sysctl_net_ipv4.c |  7 ---
 net/ipv4/tcp_input.c   | 30 +-
 net/ipv4/tcp_ipv4.c| 15 ++-
 net/ipv6/tcp_ipv6.c|  5 +
 9 files changed, 9 insertions(+), 59 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ab0230461377..ed3d0791eb27 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -640,11 +640,6 @@ tcp_tso_win_divisor - INTEGER
building larger TSO frames.
Default: 3
 
-tcp_tw_recycle - BOOLEAN
-   Enable fast recycling TIME-WAIT sockets. Default value is 0.
-   It should not be changed without advice/request of technical
-   experts.
-
 tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 622d2da27135..2e9d649ba169 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -33,7 +33,6 @@ struct inet_timewait_death_row {
atomic_ttw_count;
 
struct inet_hashinfo*hashinfo cacheline_aligned_in_smp;
-   int sysctl_tw_recycle;
int sysctl_max_tw_buckets;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c81f3b958d44..e614ad4d613e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1810,8 +1810,7 @@ struct tcp_request_sock_ops {
 __u16 *mss);
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req,
-  bool *strict);
+  const struct request_sock *req);
__u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
   struct flowi *fl, struct request_sock *req,
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 3b2bed7ca9a4..cec0e171d20c 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -177,7 +177,6 @@ enum
LINUX_MIB_TIMEWAITED,   /* TimeWaited */
LINUX_MIB_TIMEWAITRECYCLED, /* TimeWaitRecycled */
LINUX_MIB_TIMEWAITKILLED,   /* TimeWaitKilled */
-   LINUX_MIB_PAWSPASSIVEREJECTED,  /* PAWSPassiveRejected */
LINUX_MIB_PAWSACTIVEREJECTED,   /* PAWSActiveRejected */
LINUX_MIB_PAWSESTABREJECTED,/* PAWSEstabRejected */
LINUX_MIB_DELAYEDACKS,  /* DelayedACKs */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 69cf49e8356d..4ccbf464d1ac 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -199,7 +199,6 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TW", LINUX_MIB_TIMEWAITED),
SNMP_MIB_ITEM("TWRecycled", LINUX_MIB_TIMEWAITRECYCLED),
SNMP_MIB_ITEM("TWKilled", LINUX_MIB_TIMEWAITKILLED),
-   SNMP_MIB_ITEM("PAWSPassive", LINUX_MIB_PAWSPASSIVEREJECTED),
SNMP_MIB_ITEM("PAWSActive", LINUX_MIB_PAWSACTIVEREJECTED),
SNMP_MIB_ITEM("PAWSEstab", LINUX_MIB_PAWSESTABREJECTED),
SNMP_MIB_ITEM("DelayedACKs", LINUX_MIB_DELAYEDACKS),
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d6880a6149ee..1

Re: [PATCH] Enable tx timestamping on loopback and dummy

2017-03-11 Thread Soheil Hassas Yeganeh

On Sat, Mar 11, 2017 at 9:42 AM, Ezequiel Lara Gomez
 wrote:
> Also, cleanup some warnings from timestamping code.

Can you please submit the styling fixes as a separate patch?

Thanks,
Soheil

Re: [PATCH net 1/2] net: fix socket refcounting in skb_complete_wifi_ack()

2017-03-03 Thread Soheil Hassas Yeganeh

On Fri, Mar 3, 2017 at 7:01 PM, Eric Dumazet <eduma...@google.com> wrote:
> TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
> By the time TX completion happens, sk_refcnt might be already 0.
>
> sock_hold()/sock_put() would then corrupt critical state, like
> sk_wmem_alloc.
>
> Fixes: bf7fa551e0ce ("mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi 
> ack path")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Alexander Duyck <alexander.h.du...@intel.com>
> Cc: Johannes Berg <johan...@sipsolutions.net>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Willem de Bruijn <will...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/core/skbuff.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 
> f3557958e9bf147631a90b51fef0630920acd97b..e2f37a560ec43ccf60a71f190423bd265eccf594
>  100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3893,7 +3893,7 @@ void skb_complete_wifi_ack(struct sk_buff *skb, bool 
> acked)
>  {
> struct sock *sk = skb->sk;
> struct sock_exterr_skb *serr;
> -   int err;
> +   int err = 1;
>
> skb->wifi_acked_valid = 1;
> skb->wifi_acked = acked;
> @@ -3903,14 +3903,15 @@ void skb_complete_wifi_ack(struct sk_buff *skb, bool 
> acked)
> serr->ee.ee_errno = ENOMSG;
> serr->ee.ee_origin = SO_EE_ORIGIN_TXSTATUS;
>
> -   /* take a reference to prevent skb_orphan() from freeing the socket */
> -   sock_hold(sk);
> -
> -   err = sock_queue_err_skb(sk, skb);
> +   /* Take a reference to prevent skb_orphan() from freeing the socket,
> +* but only if the socket refcount is not zero.
> +*/
> +   if (likely(atomic_inc_not_zero(>sk_refcnt))) {
> +   err = sock_queue_err_skb(sk, skb);
> +   sock_put(sk);
> +   }
> if (err)
> kfree_skb(skb);
> -
> -   sock_put(sk);
>  }
>  EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
>
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>

Nice fix!

Re: [PATCH net 2/2] net: fix socket refcounting in skb_complete_tx_timestamp()

2017-03-03 Thread Soheil Hassas Yeganeh

On Fri, Mar 3, 2017 at 7:01 PM, Eric Dumazet <eduma...@google.com> wrote:
>
> TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
> By the time TX completion happens, sk_refcnt might be already 0.
>
> sock_hold()/sock_put() would then corrupt critical state, like
> sk_wmem_alloc and lead to leaks or use after free.
>
> Fixes: 62bccb8cdb69 ("net-timestamp: Make the clone operation stand-alone 
> from phy timestamping")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Alexander Duyck <alexander.h.du...@intel.com>
> Cc: Johannes Berg <johan...@sipsolutions.net>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Willem de Bruijn <will...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/core/skbuff.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 
> e2f37a560ec43ccf60a71f190423bd265eccf594..cd4ba8c6b6091651403cf74de8c60ccf69aa3e7b
>  100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3828,13 +3828,14 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> if (!skb_may_tx_timestamp(sk, false))
> return;
>
> -   /* take a reference to prevent skb_orphan() from freeing the socket */
> -   sock_hold(sk);
> -
> -   *skb_hwtstamps(skb) = *hwtstamps;
> -   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
> -
> -   sock_put(sk);
> +   /* Take a reference to prevent skb_orphan() from freeing the socket,
> +* but only if the socket refcount is not zero.
> +*/
> +   if (likely(atomic_inc_not_zero(>sk_refcnt))) {
> +   *skb_hwtstamps(skb) = *hwtstamps;
> +   __skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
> +   sock_put(sk);
> +   }
>  }
>  EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
>
> --
> 2.12.0.rc1.440.g5b76565f74-goog
>

Thanks for the fix!

Re: [PATCH net] tcp/dccp: block BH for SYN processing

2017-03-01 Thread Soheil Hassas Yeganeh

gt;
>  *** DEADLOCK ***
>
> 1 lock held by syz-executor0/5090:
>  #0:  (sk_lock-AF_INET6){+.+.+.}, at: [] lock_sock
> include/net/sock.h:1460 [inline]
>  #0:  (sk_lock-AF_INET6){+.+.+.}, at: []
> sock_setsockopt+0x233/0x1e40 net/core/sock.c:683
>
> stack backtrace:
> CPU: 1 PID: 5090 Comm: syz-executor0 Not tainted 4.10.0+ #60
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:15 [inline]
>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>  print_usage_bug+0x3ef/0x450 kernel/locking/lockdep.c:2387
>  valid_state kernel/locking/lockdep.c:2400 [inline]
>  mark_lock_irq kernel/locking/lockdep.c:2602 [inline]
>  mark_lock+0xf30/0x1410 kernel/locking/lockdep.c:3065
>  mark_irqflags kernel/locking/lockdep.c:2941 [inline]
>  __lock_acquire+0x6dc/0x3270 kernel/locking/lockdep.c:3295
>  lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
>  __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
>  _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
>  spin_lock include/linux/spinlock.h:299 [inline]
>  inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
>  reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
>  inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
>  dccp_v6_conn_request+0xada/0x11b0 net/dccp/ipv6.c:380
>  dccp_rcv_state_process+0x51e/0x1660 net/dccp/input.c:606
>  dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
>  sk_backlog_rcv include/net/sock.h:896 [inline]
>  __release_sock+0x127/0x3a0 net/core/sock.c:2052
>  release_sock+0xa5/0x2b0 net/core/sock.c:2539
>  sock_setsockopt+0x60f/0x1e40 net/core/sock.c:1016
>  SYSC_setsockopt net/socket.c:1782 [inline]
>  SyS_setsockopt+0x2fb/0x3a0 net/socket.c:1765
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> RIP: 0033:0x4458b9
> RSP: 002b:7fe8b26c2b58 EFLAGS: 0292 ORIG_RAX: 0036
> RAX: ffda RBX: 0006 RCX: 0000004458b9
> RDX: 001a RSI: 0001 RDI: 0006
> RBP: 006e2110 R08: 0010 R09: 
> R10: 208c3000 R11: 0292 R12: 00708000
> R13: 2000 R14: 1000 R15: 
>
> Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: Andrey Konovalov <andreyk...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/dccp/input.c |   10 --
>  net/ipv4/tcp_input.c |   10 --
>  2 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/net/dccp/input.c b/net/dccp/input.c
> index 
> 8fedc2d497709b3dea9202894f45bf5cab043361..4a05d78768502df69275b4f91cb03bb2ada9f4c3
>  100644
> --- a/net/dccp/input.c
> +++ b/net/dccp/input.c
> @@ -577,6 +577,7 @@ int dccp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb,
> struct dccp_sock *dp = dccp_sk(sk);
> struct dccp_skb_cb *dcb = DCCP_SKB_CB(skb);
> const int old_state = sk->sk_state;
> +   bool acceptable;
> int queued = 0;
>
> /*
> @@ -603,8 +604,13 @@ int dccp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb,
>  */
> if (sk->sk_state == DCCP_LISTEN) {
> if (dh->dccph_type == DCCP_PKT_REQUEST) {
> -   if (inet_csk(sk)->icsk_af_ops->conn_request(sk,
> -   skb) < 0)
> +   /* It is possible that we process SYN packets from 
> backlog,
> +* so we need to make sure to disable BH right there.
> +*/
> +   local_bh_disable();
> +   acceptable = 
> inet_csk(sk)->icsk_af_ops->conn_request(sk, skb) >= 0;
> +   local_bh_enable();
> +   if (!acceptable)
> return 1;
> consume_skb(skb);
> return 0;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> 2c0ff327b6dfe6919f22bf52687816e19c2c0444..39c393cc0fd3c17130cd5d8d8b37f31ad3aeafd9
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5886,9 +5886,15 @@ int tcp_rcv_state_process(struct sock *sk, struct 
> sk_buff *skb)
> if (th->syn) {
> if (th->fin)
> goto discard;
> -   if (icsk->icsk_af_ops->conn_request(sk, skb) < 0)
> -   return 1;
> +   /* It is possible that we process SYN packets from 
> backlog,
> +* so we need to make sure to disable BH right there.
> +*/
> +   local_bh_disable();
> +   acceptable = icsk->icsk_af_ops->conn_request(sk, skb) 
> >= 0;
> +   local_bh_enable();
>
> +   if (!acceptable)
> +   return 1;
> consume_skb(skb);
> return 0;
> }
>
>

Nice fix! Thanks Eric!

Re: [BUG] 4.10-rc8 - ping spinning?

2017-02-16 Thread Soheil Hassas Yeganeh

On Thu, Feb 16, 2017 at 11:08 AM,  <l...@pengaru.com> wrote:
> On Thu, Feb 16, 2017 at 10:52:19AM -0500, Soheil Hassas Yeganeh wrote:
>> On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh
>> <soh...@google.com> wrote:
>> > Thank you Vito for the report.
>> >
>> > The patch you cited actually resolves a similar backward compatibility
>> > problem for traceroute.
>> >
>> > I suspect the problem here is that there's a local error queued on the
>> > error queue after an ICMP message. ping apparently expect the
>> > sk->sk_err to be set for the local errors as well, and hence the
>> > error. Ideally, ping should read the error queue if there an EPOLLERR,
>> > because local errors never sk->sk_err on their own. That is, if we
>> > have
>>
>> [oops] That is, if we have only one local error on the error queue, we
>> cannot rely on having an error on recvmsg (i.e., sk->sk_err being set)
>> even in 4.9.
>>
>> 
>
> Hi Soheil,
>
> This doesn't appear to be trivially reproducible here by just running ping
> as it were originally discovered.  I'll see if I can reliably cause the
> malfunction somehow, but until then I can't meaningfully test patches.
>
> Perhaps a form of fault injection would make more sense if there's a
> reasonable idea of what this is stemming from?

I tried to generate different ICMP errors as well as local errors, but
unfortunately haven't been able to reproduce the problem.

> I've opened an issue with iputils on github in the event that this is found
> to be a ping bug.  Your input might be helpful there as well:
> https://github.com/iputils/iputils/issues/74

Sent a pull request. Although, we might want to at least confirm the
userspace patch fixes the issue in ping.

Thanks!
Soheil

> Thanks,
> Vito Caputo

Re: [BUG] 4.10-rc8 - ping spinning?

2017-02-16 Thread Soheil Hassas Yeganeh

On Thu, Feb 16, 2017 at 10:50 AM, Soheil Hassas Yeganeh
<soh...@google.com> wrote:
> Thank you Vito for the report.
>
> The patch you cited actually resolves a similar backward compatibility
> problem for traceroute.
>
> I suspect the problem here is that there's a local error queued on the
> error queue after an ICMP message. ping apparently expect the
> sk->sk_err to be set for the local errors as well, and hence the
> error. Ideally, ping should read the error queue if there an EPOLLERR,
> because local errors never sk->sk_err on their own. That is, if we
> have

[oops] That is, if we have only one local error on the error queue, we
cannot rely on having an error on recvmsg (i.e., sk->sk_err being set)
even in 4.9.

Thanks,
Soheil

> But as a workaround, would you mind trying the following patch to see
> if it resolves the issue reported?
>
> From: Soheil Hassas Yeganeh <soh...@google.com>
> Date: Thu, 16 Feb 2017 10:48:24 -0500
> ---
>  net/core/skbuff.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 734c71468b01..2b774b564024 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3717,7 +3717,8 @@ EXPORT_SYMBOL(sock_queue_err_skb);
>  static bool is_icmp_err_skb(const struct sk_buff *skb)
>  {
>   return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
> -   SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6);
> +   SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 ||
> +   SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL);
>  }
>
>  struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
> --
>
> Thanks!
> Soheil
>
> On Thu, Feb 16, 2017 at 7:05 AM,  <l...@pengaru.com> wrote:
>> Hello netdev,
>>
>> Please see the forwarded message below.  This was sent to linux-kernel but
>> after digging a little I suspect it's specific to the network stack.
>>
>> Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related
>> to what I'm observing:
>>
>> commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d
>> Author: Soheil Hassas Yeganeh <soh...@google.com>
>> Date:   Wed Nov 30 14:01:08 2016 -0500
>>
>> sock: reset sk_err for ICMP packets read from error queue
>>
>> Only when ICMP packets are enqueued onto the error queue,
>> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
>> in sock_dequeue_err_skb), a subsequent error queue read
>> would set sk_err to the next error on the queue, or 0 if empty.
>> As no error types other than ICMP set this field, sk_err should
>> not be modified upon dequeuing them.
>>
>> Only for ICMP errors, reset the (racy) sk_err. Some applications,
>> like traceroute, rely on it and go into a futile busy POLLERR
>> loop otherwise.
>>
>> In principle, sk_err has to be set while an ICMP error is queued.
>> Testing is_icmp_err_skb(skb_next) approximates this without
>> requiring a full queue walk. Applications that receive both ICMP
>> and other errors cannot rely on this legacy behavior, as other
>> errors do not set sk_err in the first place.
>>
>> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
>> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
>> Signed-off-by: Willem de Bruijn <will...@google.com>
>> Acked-by: Eric Dumazet <eduma...@google.com>
>> Acked-by: Maciej Żenczykowski <m...@google.com>
>> Signed-off-by: David S. Miller <da...@davemloft.net>
>>
>> Cheers,
>> Vito Caputo
>>
>>
>> - Forwarded message from l...@pengaru.com -
>>
>> Date: Thu, 16 Feb 2017 03:17:49 -0800
>> From: l...@pengaru.com
>> To: linux-ker...@vger.kernel.org
>> Subject: Re: [BUG] 4.10-rc8 - ping spinning?
>> User-Agent: Mutt/1.5.23 (2014-03-12)
>>
>> On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote:
>>> Hello list,
>>>
>>> Some rtl8192cu bugs of old got me in the habit of running ping in a shelved 
>>> (i.e. forgotten) xterm, a harmless practice which seemed to prevent the 
>>> rtl8192cu device from dying.
>>>
>>> This evening the system started getting very slow and to my surprise I found
>>> this in `top`:
>>>  5115 swivel30  10   14772   1928   1756 R  90.9  0.0   1351:41 ping
>>>  9005 swivel30  10   14772   1892   1724 R  90.9  0.0   1354:26 ping
>>>
>>> This is a dual core machine (X61s, core2

Re: [BUG] 4.10-rc8 - ping spinning?

2017-02-16 Thread Soheil Hassas Yeganeh

Thank you Vito for the report.

The patch you cited actually resolves a similar backward compatibility
problem for traceroute.

I suspect the problem here is that there's a local error queued on the
error queue after an ICMP message. ping apparently expect the
sk->sk_err to be set for the local errors as well, and hence the
error. Ideally, ping should read the error queue if there an EPOLLERR,
because local errors never sk->sk_err on their own. That is, if we
have

But as a workaround, would you mind trying the following patch to see
if it resolves the issue reported?

From: Soheil Hassas Yeganeh <soh...@google.com>
Date: Thu, 16 Feb 2017 10:48:24 -0500
---
 net/core/skbuff.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 734c71468b01..2b774b564024 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3717,7 +3717,8 @@ EXPORT_SYMBOL(sock_queue_err_skb);
 static bool is_icmp_err_skb(const struct sk_buff *skb)
 {
  return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
-   SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6);
+   SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6 ||
+   SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_LOCAL);
 }

 struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
-- 

Thanks!
Soheil

On Thu, Feb 16, 2017 at 7:05 AM,  <l...@pengaru.com> wrote:
> Hello netdev,
>
> Please see the forwarded message below.  This was sent to linux-kernel but
> after digging a little I suspect it's specific to the network stack.
>
> Perusing the net/ changes between 4.9 and 4.10-rc8 this sounded awful related
> to what I'm observing:
>
> commit 83a1a1a70e87f676fbb6086b26b6ac7f7fdd107d
> Author: Soheil Hassas Yeganeh <soh...@google.com>
> Date:   Wed Nov 30 14:01:08 2016 -0500
>
> sock: reset sk_err for ICMP packets read from error queue
>
> Only when ICMP packets are enqueued onto the error queue,
> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
> in sock_dequeue_err_skb), a subsequent error queue read
> would set sk_err to the next error on the queue, or 0 if empty.
> As no error types other than ICMP set this field, sk_err should
> not be modified upon dequeuing them.
>
> Only for ICMP errors, reset the (racy) sk_err. Some applications,
> like traceroute, rely on it and go into a futile busy POLLERR
> loop otherwise.
>
> In principle, sk_err has to be set while an ICMP error is queued.
> Testing is_icmp_err_skb(skb_next) approximates this without
> requiring a full queue walk. Applications that receive both ICMP
> and other errors cannot rely on this legacy behavior, as other
> errors do not set sk_err in the first place.
>
> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
> Signed-off-by: Willem de Bruijn <will...@google.com>
> Acked-by: Eric Dumazet <eduma...@google.com>
> Acked-by: Maciej Żenczykowski <m...@google.com>
> Signed-off-by: David S. Miller <da...@davemloft.net>
>
> Cheers,
> Vito Caputo
>
>
> - Forwarded message from l...@pengaru.com -
>
> Date: Thu, 16 Feb 2017 03:17:49 -0800
> From: l...@pengaru.com
> To: linux-ker...@vger.kernel.org
> Subject: Re: [BUG] 4.10-rc8 - ping spinning?
> User-Agent: Mutt/1.5.23 (2014-03-12)
>
> On Thu, Feb 16, 2017 at 03:03:03AM -0800, l...@pengaru.com wrote:
>> Hello list,
>>
>> Some rtl8192cu bugs of old got me in the habit of running ping in a shelved  
>>(i.e. forgotten) xterm, a harmless practice which seemed to prevent the   
>>   rtl8192cu device from dying.
>>
>> This evening the system started getting very slow and to my surprise I found
>> this in `top`:
>>  5115 swivel30  10   14772   1928   1756 R  90.9  0.0   1351:41 ping
>>  9005 swivel30  10   14772   1892   1724 R  90.9  0.0   1354:26 ping
>>
>> This is a dual core machine (X61s, core2duo 1.8Ghz), those processes are
>> burning all the free CPU in the system context.  They're identical commands,
>> just plain `ping domain.com`, to the same host.  It appears I accidentally
>> (fortuitously?) had two running, which made this event more interesting.
>>
>> I can assert that these did not begin spinning simultaneously - as you can 
>> see
>> by the cumulative time in `top` there's a small delta.  I also use a window
>> manager with builtin continuous process monitoring, and when I noticed this 
>> was
>> happening I was able to see that one of the processes had only recently begun
>> spinning, the other was spinning long eno

Re: Extending socket timestamping API for NTP

2017-02-08 Thread Soheil Hassas Yeganeh

On Tue, Feb 7, 2017 at 2:32 PM, Willem de Bruijn
 wrote:
>>> 2) new SO_TIMESTAMPING option to receive from the error queue only
>>>user data as was passed to sendmsg() instead of Ethernet frames
>>>
>>>Parsing Ethernet and IP headers (especially IPv6 options) is not
>>>fun and SOF_TIMESTAMPING_OPT_ID is not always practical, e.g. in
>>>applications which process messages from the error queue
>>>asynchronously and don't bind/connect their sockets.
>>
>> This would be useful for application writing.
>
> What kind of user data are you suggesting? Just a user-defined ID
> passed as a cmsg? Allowing such metadata to override
> skb_shinfo(skb)->tskey sounds fine.

Yes, exactly. Just a user-defined ID that overrides the
skb_shinfo(skb)->tskey.

Thanks,
Soheil

Re: Extending socket timestamping API for NTP

2017-02-07 Thread Soheil Hassas Yeganeh

On Tue, Feb 7, 2017 at 6:01 AM, Miroslav Lichvar  wrote:
> 2) new SO_TIMESTAMPING option to receive from the error queue only
>user data as was passed to sendmsg() instead of Ethernet frames
>
>Parsing Ethernet and IP headers (especially IPv6 options) is not
>fun and SOF_TIMESTAMPING_OPT_ID is not always practical, e.g. in
>applications which process messages from the error queue
>asynchronously and don't bind/connect their sockets.

This is going to be quite useful. However, I'm not sure if sending
back the original packet would be a proper API. Instead, one option is
to add a control message, so that applications can set the OPT_ID for
the timestamp. Perhaps, something like from user's perspective:

cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type  = SCM_TIMESTAMPING_OPT_ID;
cmsg->cmsg_len   = CMSG_LEN(sizeof(__u32));
*((__u32 *) CMSG_DATA(cmsg)) = my_id;

Thanks,
Soheil

[PATCH net-next v2] tcp: provide timestamps for partial writes

2017-01-04 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

For TCP sockets, TX timestamps are only captured when the user data
is successfully and fully written to the socket. In many cases,
however, TCP writes can be partial for which no timestamp is
collected.

Collect timestamps whenever any user data is (fully or partially)
copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
instead of the local skb pointer since it can be set to NULL on
the error path.

Note that tcp_write_queue_tail can be NULL, even if bytes have been
copied to the socket. This is because acknowledgements are being
processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
called tcp_write_queue_tail can be NULL. For such cases, this patch
does not collect any timestamps (i.e., it is best-effort).

This patch is written with suggestions from Willem de Bruijn and
Eric Dumazet.

Change-log V1 -> V2:
- Use sockc.tsflags instead of sk->sk_tsflags.
- Use the same code path for normal writes and errors.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Cc: Willem de Bruijn <will...@google.com>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Neal Cardwell <ncardw...@google.com>
Cc: Martin KaFai Lau <ka...@fb.com>
---
 net/ipv4/tcp.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2e3807d8eba8..ec97e4b4a62f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -429,7 +429,7 @@ EXPORT_SYMBOL(tcp_init_sock);
 
 static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
 {
-   if (tsflags) {
+   if (tsflags && skb) {
struct skb_shared_info *shinfo = skb_shinfo(skb);
struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
@@ -958,10 +958,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
page *page, int offset,
copied += copy;
offset += copy;
size -= copy;
-   if (!size) {
-   tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+   if (!size)
goto out;
-   }
 
if (skb->len < size_goal || (flags & MSG_OOB))
continue;
@@ -987,8 +985,11 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
page *page, int offset,
}
 
 out:
-   if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
-   tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
+   if (copied) {
+   tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
+   if (!(flags & MSG_SENDPAGE_NOTLAST))
+   tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
+   }
return copied;
 
 do_error:
@@ -1281,7 +1282,6 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
 
copied += copy;
if (!msg_data_left(msg)) {
-   tcp_tx_timestamp(sk, sockc.tsflags, skb);
if (unlikely(flags & MSG_EOR))
TCP_SKB_CB(skb)->eor = 1;
goto out;
@@ -1312,8 +1312,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
}
 
 out:
-   if (copied)
+   if (copied) {
+   tcp_tx_timestamp(sk, sockc.tsflags, tcp_write_queue_tail(sk));
tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
+   }
 out_nopush:
release_sock(sk);
return copied + copied_syn;
-- 
2.11.0.390.gc69c2f50cf-goog

Re: [PATCH] tcp: provide tx timestamps for partial writes

2017-01-04 Thread Soheil Hassas Yeganeh

On Wed, Jan 4, 2017 at 7:55 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
> On Tue, 2017-01-03 at 10:22 -0500, Soheil Hassas Yeganeh wrote:
> > On Mon, Jan 2, 2017 at 3:23 PM, Soheil Hassas Yeganeh <soh...@google.com> 
> > wrote:
> > > On Mon, Jan 2, 2017 at 3:20 PM, Soheil Hassas Yeganeh
> > > <soheil.k...@gmail.com> wrote:
> > >> From: Soheil Hassas Yeganeh <soh...@google.com>
> > >>
> > >> For TCP sockets, tx timestamps are only captured when the user data
> > >> is successfully and fully written to the socket. In many cases,
> > >> however, TCP writes can be partial for which no timestamp is
> > >> collected.
> > >>
> > >> Collect timestamps when the user data is partially copied into
> > >> the socket.
> > >>
> > >> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
> > >> Cc: Willem de Bruijn <will...@google.com>
> > >> Cc: Yuchung Cheng <ych...@google.com>
> > >> Cc: Eric Dumazet <eduma...@google.com>
> > >> Cc: Neal Cardwell <ncardw...@google.com>
> > >> Cc: Martin KaFai Lau <ka...@fb.com>
> > >> ---
> > >>  net/ipv4/tcp.c | 8 ++--
> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > >> index 2e3807d..c207b16 100644
> > >> --- a/net/ipv4/tcp.c
> > >> +++ b/net/ipv4/tcp.c
> > >> @@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, 
> > >> struct page *page, int offset,
> > >> return copied;
> > >>
> > >>  do_error:
> > >> -   if (copied)
> > >> +   if (copied) {
> > >> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> > >> tcp_write_queue_tail(sk));
> > >> goto out;
> > >> +   }
> > >>  out_err:
> > >> /* make sure we wake any epoll edge trigger waiter */
> > >> if (unlikely(skb_queue_len(>sk_write_queue) == 0 &&
> > >> @@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr 
> > >> *msg, size_t size)
> > >> }
> > >>
> > >>  do_error:
> > >> -   if (copied + copied_syn)
> > >> +   if (copied + copied_syn) {
> > >> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> > >> tcp_write_queue_tail(sk));
> >
> > Thanks to Willem for noting that this should be sockc.tsflags and not
> > sk->sk_tsflags. I'll send V2 to fix.
>
> Also, why not factorizing a bit and have a single point calling
> tcp_tx_timestamp() ?
>
> This would ease code review quite a bit.

Thanks Eric! will do in V2.

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> 4a044964da6670829e5c47fef52d2cd76360b59f..11357f3bd1f82fa29129dd3ecf4d270feb4a6b1d
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -958,10 +958,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
> page *page, int offset,
> copied += copy;
> offset += copy;
> size -= copy;
> -   if (!size) {
> -   tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
> +   if (!size)
> goto out;
> -   }
>
> if (skb->len < size_goal || (flags & MSG_OOB))
> continue;
> @@ -987,8 +985,11 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
> page *page, int offset,
> }
>
>  out:
> -   if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> -   tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
> +   if (copied) {
> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> tcp_write_queue_tail(sk));
> +   if (!(flags & MSG_SENDPAGE_NOTLAST))
> +   tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
> +   }
> return copied;
>
>  do_error:
> @@ -1281,7 +1282,6 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
> size_t size)
>
> copied += copy;
> if (!msg_data_left(msg)) {
> -   tcp_tx_timestamp(sk, sockc.tsflags, skb);
> if (unlikely(flags & MSG_EOR))
> TCP_SKB_CB(skb)->eor = 1;
> goto out;
> @@ -1312,8 +1312,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
> size_t size)
> }
>
>  out:
> -   if (copied)
> +   if (copied) {
> +   tcp_tx_timestamp(sk, sockc.tsflags, tcp_write_queue_tail(sk));
> tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
> +   }
>  out_nopush:
> release_sock(sk);
> return copied + copied_syn;
>
>
>

Re: [PATCH] tcp: provide tx timestamps for partial writes

2017-01-03 Thread Soheil Hassas Yeganeh

On Mon, Jan 2, 2017 at 3:23 PM, Soheil Hassas Yeganeh <soh...@google.com> wrote:
> On Mon, Jan 2, 2017 at 3:20 PM, Soheil Hassas Yeganeh
> <soheil.k...@gmail.com> wrote:
>> From: Soheil Hassas Yeganeh <soh...@google.com>
>>
>> For TCP sockets, tx timestamps are only captured when the user data
>> is successfully and fully written to the socket. In many cases,
>> however, TCP writes can be partial for which no timestamp is
>> collected.
>>
>> Collect timestamps when the user data is partially copied into
>> the socket.
>>
>> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
>> Cc: Willem de Bruijn <will...@google.com>
>> Cc: Yuchung Cheng <ych...@google.com>
>> Cc: Eric Dumazet <eduma...@google.com>
>> Cc: Neal Cardwell <ncardw...@google.com>
>> Cc: Martin KaFai Lau <ka...@fb.com>
>> ---
>>  net/ipv4/tcp.c | 8 ++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 2e3807d..c207b16 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
>> page *page, int offset,
>> return copied;
>>
>>  do_error:
>> -   if (copied)
>> +   if (copied) {
>> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
>> tcp_write_queue_tail(sk));
>> goto out;
>> +   }
>>  out_err:
>> /* make sure we wake any epoll edge trigger waiter */
>> if (unlikely(skb_queue_len(>sk_write_queue) == 0 &&
>> @@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
>> size_t size)
>> }
>>
>>  do_error:
>> -   if (copied + copied_syn)
>> +   if (copied + copied_syn) {
>> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
>> tcp_write_queue_tail(sk));

Thanks to Willem for noting that this should be sockc.tsflags and not
sk->sk_tsflags. I'll send V2 to fix.

Soheil

>> goto out;
>> +   }
>>  out_err:
>> err = sk_stream_error(sk, flags, err);
>> /* make sure we wake any epoll edge trigger waiter */
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>
> I'm sorry for the incomplete annotation. This is for [net-next].
>
> Thanks,
> Soheil

Re: [PATCH] tcp: provide tx timestamps for partial writes

2017-01-02 Thread Soheil Hassas Yeganeh

On Mon, Jan 2, 2017 at 3:20 PM, Soheil Hassas Yeganeh
<soheil.k...@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soh...@google.com>
>
> For TCP sockets, tx timestamps are only captured when the user data
> is successfully and fully written to the socket. In many cases,
> however, TCP writes can be partial for which no timestamp is
> collected.
>
> Collect timestamps when the user data is partially copied into
> the socket.
>
> Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: Willem de Bruijn <will...@google.com>
> Cc: Yuchung Cheng <ych...@google.com>
> Cc: Eric Dumazet <eduma...@google.com>
> Cc: Neal Cardwell <ncardw...@google.com>
> Cc: Martin KaFai Lau <ka...@fb.com>
> ---
>  net/ipv4/tcp.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 2e3807d..c207b16 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
> page *page, int offset,
> return copied;
>
>  do_error:
> -   if (copied)
> +   if (copied) {
> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> tcp_write_queue_tail(sk));
> goto out;
> +   }
>  out_err:
> /* make sure we wake any epoll edge trigger waiter */
> if (unlikely(skb_queue_len(>sk_write_queue) == 0 &&
> @@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
> size_t size)
> }
>
>  do_error:
> -   if (copied + copied_syn)
> +   if (copied + copied_syn) {
> +   tcp_tx_timestamp(sk, sk->sk_tsflags, 
> tcp_write_queue_tail(sk));
> goto out;
> +   }
>  out_err:
> err = sk_stream_error(sk, flags, err);
> /* make sure we wake any epoll edge trigger waiter */
> --
> 2.8.0.rc3.226.g39d4020
>

I'm sorry for the incomplete annotation. This is for [net-next].

Thanks,
Soheil

[PATCH] tcp: provide tx timestamps for partial writes

2017-01-02 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

For TCP sockets, tx timestamps are only captured when the user data
is successfully and fully written to the socket. In many cases,
however, TCP writes can be partial for which no timestamp is
collected.

Collect timestamps when the user data is partially copied into
the socket.

Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Cc: Willem de Bruijn <will...@google.com>
Cc: Yuchung Cheng <ych...@google.com>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Neal Cardwell <ncardw...@google.com>
Cc: Martin KaFai Lau <ka...@fb.com>
---
 net/ipv4/tcp.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2e3807d..c207b16 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
page *page, int offset,
return copied;
 
 do_error:
-   if (copied)
+   if (copied) {
+   tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
goto out;
+   }
 out_err:
/* make sure we wake any epoll edge trigger waiter */
if (unlikely(skb_queue_len(>sk_write_queue) == 0 &&
@@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
}
 
 do_error:
-   if (copied + copied_syn)
+   if (copied + copied_syn) {
+   tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
goto out;
+   }
 out_err:
err = sk_stream_error(sk, flags, err);
/* make sure we wake any epoll edge trigger waiter */
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next] sock: reset sk_err for ICMP packets read from error queue

2016-11-30 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Only when ICMP packets are enqueued onto the error queue,
sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
in sock_dequeue_err_skb), a subsequent error queue read
would set sk_err to the next error on the queue, or 0 if empty.
As no error types other than ICMP set this field, sk_err should
not be modified upon dequeuing them.

Only for ICMP errors, reset the (racy) sk_err. Some applications,
like traceroute, rely on it and go into a futile busy POLLERR
loop otherwise.

In principle, sk_err has to be set while an ICMP error is queued.
Testing is_icmp_err_skb(skb_next) approximates this without
requiring a full queue walk. Applications that receive both ICMP
and other errors cannot rely on this legacy behavior, as other
errors do not set sk_err in the first place.

Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
---
 net/core/skbuff.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d1d1a5a..8dad391 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3714,20 +3714,29 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff 
*skb)
 }
 EXPORT_SYMBOL(sock_queue_err_skb);
 
+static bool is_icmp_err_skb(const struct sk_buff *skb)
+{
+   return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
+  SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6);
+}
+
 struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
 {
struct sk_buff_head *q = >sk_error_queue;
-   struct sk_buff *skb, *skb_next;
+   struct sk_buff *skb, *skb_next = NULL;
+   bool icmp_next = false;
unsigned long flags;
-   int err = 0;
 
spin_lock_irqsave(>lock, flags);
skb = __skb_dequeue(q);
if (skb && (skb_next = skb_peek(q)))
-   err = SKB_EXT_ERR(skb_next)->ee.ee_errno;
+   icmp_next = is_icmp_err_skb(skb_next);
spin_unlock_irqrestore(>lock, flags);
 
-   if (err)
+if (is_icmp_err_skb(skb) && !icmp_next)
+   sk->sk_err = 0;
+
+   if (skb_next)
sk->sk_error_report(sk);
 
return skb;
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH net-next] tcp: remove unaligned accesses from tcp_get_info()

2016-11-09 Thread Soheil Hassas Yeganeh

On Wed, Nov 9, 2016 at 2:24 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
> From: Eric Dumazet <eduma...@google.com>
>
> After commit 6ed46d1247a5 ("sock_diag: align nlattr properly when
> needed"), tcp_get_info() gets 64bit aligned memory, so we can avoid
> the unaligned helpers.
>
> Suggested-by: David Miller <da...@davemloft.net>
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Soheil Hassas Yeganeh <soh...@google.com>

> ---
>  net/ipv4/tcp.c |   11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> a7d54cbcdabbbc0838c00e05074ef15560665f2d..f8f924ca662d5c84e438a6eacbe8f734ff0674ae
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -279,7 +279,6 @@
>
>  #include 
>  #include 
> -#include 
>  #include 
>
>  int sysctl_tcp_min_tso_segs __read_mostly = 2;
> @@ -2722,11 +2721,11 @@ void tcp_get_info(struct sock *sk, struct tcp_info 
> *info)
> /* Report meaningful fields for all TCP states, including listeners */
> rate = READ_ONCE(sk->sk_pacing_rate);
> rate64 = rate != ~0U ? rate : ~0ULL;
> -   put_unaligned(rate64, >tcpi_pacing_rate);
> +   info->tcpi_pacing_rate = rate64;
>
> rate = READ_ONCE(sk->sk_max_pacing_rate);
> rate64 = rate != ~0U ? rate : ~0ULL;
> -   put_unaligned(rate64, >tcpi_max_pacing_rate);
> +   info->tcpi_max_pacing_rate = rate64;
>
> info->tcpi_reordering = tp->reordering;
> info->tcpi_snd_cwnd = tp->snd_cwnd;
> @@ -2792,8 +2791,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info 
> *info)
>
> slow = lock_sock_fast(sk);
>
> -   put_unaligned(tp->bytes_acked, >tcpi_bytes_acked);
> -   put_unaligned(tp->bytes_received, >tcpi_bytes_received);
> +   info->tcpi_bytes_acked = tp->bytes_acked;
> +   info->tcpi_bytes_received = tp->bytes_received;
> info->tcpi_notsent_bytes = max_t(int, 0, tp->write_seq - tp->snd_nxt);
>
> unlock_sock_fast(sk, slow);
> @@ -2811,7 +2810,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info 
> *info)
> if (rate && intv) {
> rate64 = (u64)rate * tp->mss_cache * USEC_PER_SEC;
> do_div(rate64, intv);
> -   put_unaligned(rate64, >tcpi_delivery_rate);
> +   info->tcpi_delivery_rate = rate64;
> }
>  }
>  EXPORT_SYMBOL_GPL(tcp_get_info);
>
>

Very nice! Thanks.

[PATCH net] sock: fix sendmmsg for partial sendmsg

2016-11-04 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soh...@google.com>

Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e6020 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh <soh...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Neal Cardwell <ncardw...@google.com>
---
 net/socket.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/socket.c b/net/socket.c
index 5a9bf5e..272518b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2038,6 +2038,8 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
if (err)
break;
++datagrams;
+   if (msg_data_left(_sys))
+   break;
cond_resched();
}
 
-- 
2.8.0.rc3.226.g39d4020

1 2 >

1 - 100 of 183 matches

Mail list logo