[PATCH 3.8 06/91] tcp: TSO packets automatic sizing

2013-11-07 Thread Kamal Mostafa
3.8.13.13 -stable review patch.  If anyone has any objections, please let me 
know.

--

From: Eric Dumazet 

commit 95bd09eb27507691520d39ee1044d6ad831c1168 upstream.
commit 02cf4ebd82ff0ac7254b88e466820a290ed8289a upstream.
commit 7eec4174ff29cd42f2acfae8112f51c228545d40 upstream.

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs

Signed-off-by: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
Cc: Van Jacobson 
Cc: Tom Herbert 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa 
---
 Documentation/networking/ip-sysctl.txt |  9 +
 include/net/sock.h |  2 ++
 include/net/tcp.h  |  1 +
 net/core/sock.c|  1 +
 net/ipv4/sysctl_net_ipv4.c | 10 ++
 net/ipv4/tcp.c | 28 ++-
 net/ipv4/tcp_input.c   | 35 +-
 net/ipv4/tcp_output.c  |  2 +-
 8 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index dbca661..62b9a61 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -510,6 +510,15 @@ tcp_syn_retries - INTEGER
 tcp_timestamps - BOOLEAN
Enable timestamps as defined in RFC1323.
 
+tcp_min_tso_segs - INTEGER
+   Minimal number of segments per TSO frame.
+   Since linux-3.12, TCP does an automatic sizing of TSO frames,
+   depending on flow rate, instead of filling 64Kbytes packets.
+   For specific usages, it's possible to force TCP to build big
+   TSO frames. Note that TCP stack might split too big TSO packets
+   if available window is too small.
+   Default: 2
+
 tcp_tso_win_divisor - INTEGER
This allows control over what percentage of the congestion window
can be consumed by a single TSO frame.
diff --git a/include/net/sock.h b/include/net/sock.h
index 873abca..94871cc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -228,6 +228,7 @@ struct cg_proto;
   *@sk_wmem_queued: persistent queue size
   *@sk_forward_alloc: space allocated forward
   *@sk_allocation: allocation mode
+  *@sk_pacing_rate: Pacing rate (if supported by transport/packet 
scheduler)
   *@sk_sndbuf: size of send buffer in bytes
   *@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -352,6 +353,7 @@ struct sock {
kmemcheck_bitfield_end(flags);
int sk_wmem_queued;
gfp_t   sk_allocation;
+   u32 sk_pacing_rate; /* bytes per second */
netdev_features_t   sk_route_caps;
netdev_features_t   sk_route_nocaps;
int sk_gso_type;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4da2167..45f3368 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,6 +292,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern int sysctl_tcp_min_tso_segs;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/core/sock.c b/net/core/sock.c
index b8af814..fc0d751 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ 

[PATCH 3.8 06/91] tcp: TSO packets automatic sizing

2013-11-07 Thread Kamal Mostafa
3.8.13.13 -stable review patch.  If anyone has any objections, please let me 
know.

--

From: Eric Dumazet eduma...@google.com

commit 95bd09eb27507691520d39ee1044d6ad831c1168 upstream.
commit 02cf4ebd82ff0ac7254b88e466820a290ed8289a upstream.
commit 7eec4174ff29cd42f2acfae8112f51c228545d40 upstream.

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp-xmit_size_goal_segs

Signed-off-by: Eric Dumazet eduma...@google.com
Cc: Neal Cardwell ncardw...@google.com
Cc: Yuchung Cheng ych...@google.com
Cc: Van Jacobson v...@google.com
Cc: Tom Herbert therb...@google.com
Acked-by: Yuchung Cheng ych...@google.com
Acked-by: Neal Cardwell ncardw...@google.com
Signed-off-by: David S. Miller da...@davemloft.net
[ kamal: backport to 3.8 (context) ]
Signed-off-by: Kamal Mostafa ka...@canonical.com
---
 Documentation/networking/ip-sysctl.txt |  9 +
 include/net/sock.h |  2 ++
 include/net/tcp.h  |  1 +
 net/core/sock.c|  1 +
 net/ipv4/sysctl_net_ipv4.c | 10 ++
 net/ipv4/tcp.c | 28 ++-
 net/ipv4/tcp_input.c   | 35 +-
 net/ipv4/tcp_output.c  |  2 +-
 8 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index dbca661..62b9a61 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -510,6 +510,15 @@ tcp_syn_retries - INTEGER
 tcp_timestamps - BOOLEAN
Enable timestamps as defined in RFC1323.
 
+tcp_min_tso_segs - INTEGER
+   Minimal number of segments per TSO frame.
+   Since linux-3.12, TCP does an automatic sizing of TSO frames,
+   depending on flow rate, instead of filling 64Kbytes packets.
+   For specific usages, it's possible to force TCP to build big
+   TSO frames. Note that TCP stack might split too big TSO packets
+   if available window is too small.
+   Default: 2
+
 tcp_tso_win_divisor - INTEGER
This allows control over what percentage of the congestion window
can be consumed by a single TSO frame.
diff --git a/include/net/sock.h b/include/net/sock.h
index 873abca..94871cc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -228,6 +228,7 @@ struct cg_proto;
   *@sk_wmem_queued: persistent queue size
   *@sk_forward_alloc: space allocated forward
   *@sk_allocation: allocation mode
+  *@sk_pacing_rate: Pacing rate (if supported by transport/packet 
scheduler)
   *@sk_sndbuf: size of send buffer in bytes
   *@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -352,6 +353,7 @@ struct sock {
kmemcheck_bitfield_end(flags);
int sk_wmem_queued;
gfp_t   sk_allocation;
+   u32 sk_pacing_rate; /* bytes per second */
netdev_features_t   sk_route_caps;
netdev_features_t   sk_route_nocaps;
int sk_gso_type;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4da2167..45f3368 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,6 +292,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern int sysctl_tcp_min_tso_segs;
 
 extern atomic_long_t