Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-29 Thread Daniel Metz

  BD  R Y
--
mean TCPRecovLat  3s  -7%   +39% +38%
mean TCPRecovLat252s +1%   -11%  -11%


This is indeed very interesting and somewhat unexpected. Do you have any 
clue why Y is as bad as R and so much worse than B? By my understanding 
I would have expected Y to be similar to B. At least tests on the Mean 
Response Waiting Time of sender limited flows show hardly any difference 
to B (as expected).


Also, is a potential longer time in TCPRecovLat such a bad thing 
considering your information on HTTP response performance?


Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-15 Thread Daniel Metz

Yuchung Cheng | 2016-06-15 20:02:
> Let me explain in a different way:
>
> * RFC6298 applies a lower bound of 1 second to RTO (section 2.4)
>
> * Linux currently applies a lower bound of 200ms (min_rto) to
> K*RTTVAR, but /not/ RTO itself.
>
> * This patch applies the lower bound of 200ms to RTO, similar to RFC6298
>
>
> Let's say the SRTT is 100ms and RTT variations is 10ms. The variation
> is low because we've been sending large chunks, and RTT is fairly
> stable, and we sample on every ACK. The RTOs produced are
>
> RFC6298: RTO=1s
> Linux: RTO=300ms
> This patch: RTO=200ms
>
> Then we send 1 packet out. The receiver delays the ACK up to 200ms.
> The actual RTT can be longer because other network components further
> delay the data or the ACK. This patch would surely fire the RTO
> spuriously.
>
> so we can either implement RFC6298 faithfully, or apply the
> lower-bound as-is, or something in between. But the current patch
> as-is is more aggressive. Did I miss something?

Thank you for the clarification. The fundamental thought of this patch 
was to decrease Linux RTO overestimation. This also involved not 
clinging to the RFC 6298 MinRTO of 1 second ((2.4) "[...] at the same 
time acknowledging that at some future point, research may show that a 
smaller minimum RTO is acceptable or superior."). A more aggressive RTO 
will of course increase the amount of Spurious Retransmission. The 
question is, if the benefit is higher than the sacrifice. The tests we 
conducted did not show significant negative impact so far. However, for 
sender-limited TCP flows the results were promising.


Daniel


[PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-15 Thread Daniel Metz
From: Daniel Metz <daniel.m...@rohde-schwarz.com>

This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less.  The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

 Mean Response Waiting Time [milliseconds]

PER [%] |   0.5  11.5  2  3  5  7 10
+---
old | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1

Detailed analysis:
https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/

Reasoning for historic design:
Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP.
Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf

Signed-off-by: Hagen Paul Pfeifer <ha...@jauu.net>
Signed-off-by: Daniel Metz <dm...@mytum.de>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Yuchung Cheng <ych...@google.com>
---

 v3:
  - remove mdev_us

 v2:
  - Using the RFC 6298 compliant implementation, the tcp_sock struct variable
u32 mdev_max_us becomes obsolete and consequently is being removed.
  - Add reference to Kuznetsov paper


 include/linux/tcp.h  |  2 --
 net/ipv4/tcp.c   |  3 +-
 net/ipv4/tcp_input.c | 72 ++--
 net/ipv4/tcp_metrics.c   |  5 ++--
 net/ipv4/tcp_minisocks.c |  1 -
 5 files changed, 18 insertions(+), 65 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b12..3128eb1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -230,8 +230,6 @@ struct tcp_sock {
 
 /* RTT measurement */
u32 srtt_us;/* smoothed round trip time << 3 in usecs */
-   u32 mdev_us;/* medium deviation */
-   u32 mdev_max_us;/* maximal mdev for the last rtt period */
u32 rttvar_us;  /* smoothed mdev_max*/
u32 rtt_seq;/* sequence number to update rttvar */
struct rtt_meas {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed14..4a7597c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -386,7 +386,6 @@ void tcp_init_sock(struct sock *sk)
INIT_LIST_HEAD(>tsq_node);
 
icsk->icsk_rto = TCP_TIMEOUT_INIT;
-   tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
tp->rtt_min[0].rtt = ~0U;
 
/* So many TCP implementations out there (incorrectly) count the
@@ -2703,7 +2702,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_pmtu = icsk->icsk_pmtu_cookie;
info->tcpi_rcv_ssthresh = tp->rcv_ssthresh;
info->tcpi_rtt = tp->srtt_us >> 3;
-   info->tcpi_rttvar = tp->mdev_us >> 2;
+   info->tcpi_rttvar = tp->rttvar_us >> 2;
info->tcpi_snd_ssthresh = tp->snd_ssthresh;
info->tcpi_snd_cwnd = tp->snd_cwnd;
info->tcpi_advmss = tp->advmss;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff..279f5f7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,19 @@ static void tcp_rtt_estimator(struct sock *sk, long 
mrtt_us)
long m = mrtt_us; /* RTT */
u32 srtt = tp->srtt_us;
 
-   /*  The following amusing 

[PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-14 Thread Daniel Metz
From: Daniel Metz <daniel.m...@rohde-schwarz.com>

This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less.  The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

 Mean Response Waiting Time [milliseconds]

PER [%] |   0.5  11.5  2  3  5  7 10
+---
old | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1


Detailed analysis:
https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/

Reasoning for historic design:
Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP.
Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf


Signed-off-by: Hagen Paul Pfeifer <ha...@jauu.net>
Signed-off-by: Daniel Metz <dm...@mytum.de>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Yuchung Cheng <ych...@google.com>
---

v2:
 - Using the RFC 6298 compliant implementation, the tcp_sock struct variable
 u32 mdev_max_us becomes obsolete and consequently is being removed.
 - Add reference to Kuznetsov paper


 include/linux/tcp.h|  1 -
 net/ipv4/tcp_input.c   | 74 --
 net/ipv4/tcp_metrics.c |  2 +-
 3 files changed, 18 insertions(+), 59 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7be9b12..d1790c5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -231,7 +231,6 @@ struct tcp_sock {
 /* RTT measurement */
u32 srtt_us;/* smoothed round trip time << 3 in usecs */
u32 mdev_us;/* medium deviation */
-   u32 mdev_max_us;/* maximal mdev for the last rtt period */
u32 rttvar_us;  /* smoothed mdev_max*/
u32 rtt_seq;/* sequence number to update rttvar */
struct rtt_meas {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff..0d53537 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long 
mrtt_us)
long m = mrtt_us; /* RTT */
u32 srtt = tp->srtt_us;
 
-   /*  The following amusing code comes from Jacobson's
-*  article in SIGCOMM '88.  Note that rtt and mdev
-*  are scaled versions of rtt and mean deviation.
-*  This is designed to be as fast as possible
-*  m stands for "measurement".
-*
-*  On a 1990 paper the rto value is changed to:
-*  RTO = rtt + 4 * mdev
-*
-* Funny. This algorithm seems to be very broken.
-* These formulae increase RTO, when it should be decreased, increase
-* too slowly, when it should be increased quickly, decrease too quickly
-* etc. I guess in BSD RTO takes ONE value, so that it is absolutely
-* does not matter how to _calculate_ it. Seems, it was trap
-* that VJ failed to avoid. 8)
-*/
if (srtt != 0) {
-   m -= (srtt >> 3);   /* m is now error in rtt est */
-   srtt += m;  /* rtt = 7/8 rtt + 1/8 new */
-   if (m < 0) {
-   m = -m; 

[PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-13 Thread Daniel Metz
This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less. The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

 Mean Response Waiting Time [milliseconds]

PER [%] |   0.5  11.5  2  3  5  7 10
+---
old | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1


Detailed Analysis:

https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/


Signed-off-by: Hagen Paul Pfeifer <ha...@jauu.net>
Signed-off-by: Daniel Metz <dm...@mytum.de>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Yuchung Cheng <ych...@google.com>
Cc: Neal Cardwell <ncardw...@google.com>
---

We removed outdated comments in the code, though more cleanup may required.


 net/ipv4/tcp_input.c | 74 
 1 file changed, 17 insertions(+), 57 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d6c8f4cd0..a0f66f8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long 
mrtt_us)
long m = mrtt_us; /* RTT */
u32 srtt = tp->srtt_us;
 
-   /*  The following amusing code comes from Jacobson's
-*  article in SIGCOMM '88.  Note that rtt and mdev
-*  are scaled versions of rtt and mean deviation.
-*  This is designed to be as fast as possible
-*  m stands for "measurement".
-*
-*  On a 1990 paper the rto value is changed to:
-*  RTO = rtt + 4 * mdev
-*
-* Funny. This algorithm seems to be very broken.
-* These formulae increase RTO, when it should be decreased, increase
-* too slowly, when it should be increased quickly, decrease too quickly
-* etc. I guess in BSD RTO takes ONE value, so that it is absolutely
-* does not matter how to _calculate_ it. Seems, it was trap
-* that VJ failed to avoid. 8)
-*/
if (srtt != 0) {
-   m -= (srtt >> 3);   /* m is now error in rtt est */
-   srtt += m;  /* rtt = 7/8 rtt + 1/8 new */
-   if (m < 0) {
-   m = -m; /* m is now abs(error) */
-   m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-   /* This is similar to one of Eifel findings.
-* Eifel blocks mdev updates when rtt decreases.
-* This solution is a bit different: we use finer gain
-* for mdev in this case (alpha*beta).
-* Like Eifel it also prevents growth of rto,
-* but also it limits too fast rto decreases,
-* happening in pure Eifel.
-*/
-   if (m > 0)
-   m >>= 3;
-   } else {
-   m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-   }
-   tp->mdev_us += m;   /* mdev = 3/4 mdev + 1/4 new */
-   if (tp->mdev_us > tp->mdev_max_us) {
-   tp->mdev_max_us = tp->mdev_us