Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
BD R Y -- mean TCPRecovLat 3s -7% +39% +38% mean TCPRecovLat252s +1% -11% -11% This is indeed very interesting and somewhat unexpected. Do you have any clue why Y is as bad as R and so much worse than B? By my understanding I would have expected Y to be similar to B. At least tests on the Mean Response Waiting Time of sender limited flows show hardly any difference to B (as expected). Also, is a potential longer time in TCPRecovLat such a bad thing considering your information on HTTP response performance?
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
On Tue, Jun 21, 2016 at 10:53 PM, Yuchung Chengwrote: > > On Fri, Jun 17, 2016 at 11:56 AM, Yuchung Cheng wrote: > > > > On Fri, Jun 17, 2016 at 11:32 AM, David Miller wrote: > > > > > > From: Daniel Metz > > > Date: Wed, 15 Jun 2016 20:00:03 +0200 > > > > > > > This patch adjusts Linux RTO calculation to be RFC6298 Standard > > > > compliant. MinRTO is no longer added to the computed RTO, RTO damping > > > > and overestimation are decreased. > > > ... > > > > > > Yuchung, I assume I am waiting for you to do the testing you said > > > you would do for this patch, right? > > Yes I spent the last two days resolving some unrelated glitches to > > start my testing on Web servers. I should be able to get some results > > over the weekend. > > > > I will test > > 0) current Linux > > 1) this patch > > 2) RFC6298 with min_RTO=1sec > > 3) RFC6298 with minimum RTTVAR of 200ms (so it is more like current > > Linux style of min RTO which only applies to RTTVAR) > > > > and collect the TCP latency (how long to send an HTTP response) and > > (spurious) timeout & retransmission stats. > > > Thanks for the patience. I've collected data from some Google Web > servers. They serve both a mix of US and SouthAm users using > HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps, > gmails, etc but not Youtube videos). The mean RTT is about 100ms. > > The user connections were split into 4 groups of different TCP RTO > configs. Each group has many millions of connections but the > size variation among groups is well under 1%. > > B: baseline Linux > D: this patch > R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298 > Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B) > > For mean TCP latency of HTTP responses (first byte sent to last byte > acked), B < R < Y < D. But the differences are so insignificant (<1%). > The median, 95pctl, and 99pctl has similar indifference. In summary > there's hardly visible impact on latency. I also look at only response > less than 4KB but do not see a different picture. > > The main difference is the retransmission rate where R =~ Y < B =~D. > R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal > more interesting details. The table shows the deltas in percentage to > the baseline B. > > D R Y > -- > Timeout +12% -16% -16% > TailLossProb +28%-7% -7% > DSACK_rcvd +37%-7% -7% > Cwnd-undo+16% -29% -29% > > RTO change affects TLP because TLP will use the min of RTO and TLP > timer value to arm the probe timer. > > The stats indicate that the main culprit of spurious timeouts / rtx is > the RTO lower-bound. But they also show the RFC RTTVAR averaging is as > good as current Linux approach. > > Given that I would recommend we revise this patch to use the RFC > averaging but keep existing lower-bound (of RTTVAR to 200ms). We can > further experiment the lower-bound and change that in a separate > patch. Hi I have some update. I instrumented the kernel to capture the time spent in recovery (attached). The latency measurement starts when TCP goes into recovery, triggered by either ACKs or RTOs. The start time is the (original) sent time of the first unacked packet. The end time is when the ACK covers the highest sent sequence when recovery started. The total latency in usec and count are recorded in MIB_TCPRECOVLAT and MIB_TCPRECOVCNT. If the connection times out or closes while the sender was still in recovery, the total latency and count are stored in MIB_TCPRECOVLAT2 and MIB_TCPRECOVCNT2. This second bucket is to capture long recovery that led to eventual connection aborts. Since network stat is usually power distribution, the mean of such distribution is gonna be dominated by the tail. but the new metrics still shows very interesting impact of different RTOs. Using the same table format like my previous email. This table shows the difference in percentage to the baseline. BD R Y -- mean TCPRecovLat 3s -7% +39% +38% mean TCPRecovLat252s +1% -11% -11% The new metrics show that lower-bounding RTO to 200ms (D) indeed lowers the latency. But by my previous analysis, D has a lot more spurious rtx and TLPs (which the collateral damage on latency is not captured by these metrics). And note that TLP timer uses the min of RTO and TLP timeout, so TLP fires 28% more often in (D). Therefore the latency may be mainly benefited from a faster TLP timer. Nevertheless the significant impacts on recovery latency do not show up on the response latency we measured earlier. My conjecture is that only a small fraction of flows experience losses so even a 40% increase on average on loss recovery does not move the needle, or the latency
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
On Wed, Jun 22, 2016 at 4:21 AM, Hagen Paul Pfeiferwrote: > > > On June 22, 2016 at 7:53 AM Yuchung Cheng wrote: > > > > Thanks for the patience. I've collected data from some Google Web > > servers. They serve both a mix of US and SouthAm users using > > HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps, > > gmails, etc but not Youtube videos). The mean RTT is about 100ms. > > > > The user connections were split into 4 groups of different TCP RTO > > configs. Each group has many millions of connections but the > > size variation among groups is well under 1%. > > > > B: baseline Linux > > D: this patch > > R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298 > > Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like > > B) > > > > For mean TCP latency of HTTP responses (first byte sent to last byte > > acked), B < R < Y < D. But the differences are so insignificant (<1%). > > The median, 95pctl, and 99pctl has similar indifference. In summary > > there's hardly visible impact on latency. I also look at only response > > less than 4KB but do not see a different picture. > > > > The main difference is the retransmission rate where R =~ Y < B =~D. > > R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal > > more interesting details. The table shows the deltas in percentage to > > the baseline B. > > > > D R Y > > -- > > Timeout +12% -16% -16% > > TailLossProb +28%-7% -7% > > DSACK_rcvd +37%-7% -7% > > Cwnd-undo+16% -29% -29% > > > > RTO change affects TLP because TLP will use the min of RTO and TLP > > timer value to arm the probe timer. > > > > The stats indicate that the main culprit of spurious timeouts / rtx is > > the RTO lower-bound. But they also show the RFC RTTVAR averaging is as > > good as current Linux approach. > > > > Given that I would recommend we revise this patch to use the RFC > > averaging but keep existing lower-bound (of RTTVAR to 200ms). We can > > further experiment the lower-bound and change that in a separate > > patch. > > Great news Yuchung! > > Then Daniel will prepare v4 with a min-rto lower bound: > > max(RTTVAR, tcp_rto_min_us(struct sock)) > > Any further suggestions Yuchung, Eric? We will also feed this v4 in our test > environment to check the behavior for sender limited, non-continuous flows. yes a small one: I think the patch should change __tcp_set_rto() instead of tcp_set_rto() so it applies to recurring timeouts as well. > > Hagen
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
> On June 22, 2016 at 7:53 AM Yuchung Chengwrote: > > Thanks for the patience. I've collected data from some Google Web > servers. They serve both a mix of US and SouthAm users using > HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps, > gmails, etc but not Youtube videos). The mean RTT is about 100ms. > > The user connections were split into 4 groups of different TCP RTO > configs. Each group has many millions of connections but the > size variation among groups is well under 1%. > > B: baseline Linux > D: this patch > R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298 > Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B) > > For mean TCP latency of HTTP responses (first byte sent to last byte > acked), B < R < Y < D. But the differences are so insignificant (<1%). > The median, 95pctl, and 99pctl has similar indifference. In summary > there's hardly visible impact on latency. I also look at only response > less than 4KB but do not see a different picture. > > The main difference is the retransmission rate where R =~ Y < B =~D. > R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal > more interesting details. The table shows the deltas in percentage to > the baseline B. > > D R Y > -- > Timeout +12% -16% -16% > TailLossProb +28%-7% -7% > DSACK_rcvd +37%-7% -7% > Cwnd-undo+16% -29% -29% > > RTO change affects TLP because TLP will use the min of RTO and TLP > timer value to arm the probe timer. > > The stats indicate that the main culprit of spurious timeouts / rtx is > the RTO lower-bound. But they also show the RFC RTTVAR averaging is as > good as current Linux approach. > > Given that I would recommend we revise this patch to use the RFC > averaging but keep existing lower-bound (of RTTVAR to 200ms). We can > further experiment the lower-bound and change that in a separate > patch. Great news Yuchung! Then Daniel will prepare v4 with a min-rto lower bound: max(RTTVAR, tcp_rto_min_us(struct sock)) Any further suggestions Yuchung, Eric? We will also feed this v4 in our test environment to check the behavior for sender limited, non-continuous flows. Hagen
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
On Fri, Jun 17, 2016 at 11:56 AM, Yuchung Chengwrote: > > On Fri, Jun 17, 2016 at 11:32 AM, David Miller wrote: > > > > From: Daniel Metz > > Date: Wed, 15 Jun 2016 20:00:03 +0200 > > > > > This patch adjusts Linux RTO calculation to be RFC6298 Standard > > > compliant. MinRTO is no longer added to the computed RTO, RTO damping > > > and overestimation are decreased. > > ... > > > > Yuchung, I assume I am waiting for you to do the testing you said > > you would do for this patch, right? > Yes I spent the last two days resolving some unrelated glitches to > start my testing on Web servers. I should be able to get some results > over the weekend. > > I will test > 0) current Linux > 1) this patch > 2) RFC6298 with min_RTO=1sec > 3) RFC6298 with minimum RTTVAR of 200ms (so it is more like current > Linux style of min RTO which only applies to RTTVAR) > > and collect the TCP latency (how long to send an HTTP response) and > (spurious) timeout & retransmission stats. > Thanks for the patience. I've collected data from some Google Web servers. They serve both a mix of US and SouthAm users using HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps, gmails, etc but not Youtube videos). The mean RTT is about 100ms. The user connections were split into 4 groups of different TCP RTO configs. Each group has many millions of connections but the size variation among groups is well under 1%. B: baseline Linux D: this patch R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298 Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B) For mean TCP latency of HTTP responses (first byte sent to last byte acked), B < R < Y < D. But the differences are so insignificant (<1%). The median, 95pctl, and 99pctl has similar indifference. In summary there's hardly visible impact on latency. I also look at only response less than 4KB but do not see a different picture. The main difference is the retransmission rate where R =~ Y < B =~D. R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal more interesting details. The table shows the deltas in percentage to the baseline B. D R Y -- Timeout +12% -16% -16% TailLossProb +28%-7% -7% DSACK_rcvd +37%-7% -7% Cwnd-undo+16% -29% -29% RTO change affects TLP because TLP will use the min of RTO and TLP timer value to arm the probe timer. The stats indicate that the main culprit of spurious timeouts / rtx is the RTO lower-bound. But they also show the RFC RTTVAR averaging is as good as current Linux approach. Given that I would recommend we revise this patch to use the RFC averaging but keep existing lower-bound (of RTTVAR to 200ms). We can further experiment the lower-bound and change that in a separate patch.
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
On Fri, Jun 17, 2016 at 11:32 AM, David Millerwrote: > > From: Daniel Metz > Date: Wed, 15 Jun 2016 20:00:03 +0200 > > > This patch adjusts Linux RTO calculation to be RFC6298 Standard > > compliant. MinRTO is no longer added to the computed RTO, RTO damping > > and overestimation are decreased. > ... > > Yuchung, I assume I am waiting for you to do the testing you said > you would do for this patch, right? Yes I spent the last two days resolving some unrelated glitches to start my testing on Web servers. I should be able to get some results over the weekend. I will test 0) current Linux 1) this patch 2) RFC6298 with min_RTO=1sec 3) RFC6298 with minimum RTTVAR of 200ms (so it is more like current Linux style of min RTO which only applies to RTTVAR) and collect the TCP latency (how long to send an HTTP response) and (spurious) timeout & retransmission stats. I didn't respond to Hagen's email yet b/c I thought data would help the discussion better :-)
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
From: Daniel MetzDate: Wed, 15 Jun 2016 20:00:03 +0200 > This patch adjusts Linux RTO calculation to be RFC6298 Standard > compliant. MinRTO is no longer added to the computed RTO, RTO damping > and overestimation are decreased. ... Yuchung, I assume I am waiting for you to do the testing you said you would do for this patch, right?
[PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
From: Daniel MetzThis patch adjusts Linux RTO calculation to be RFC6298 Standard compliant. MinRTO is no longer added to the computed RTO, RTO damping and overestimation are decreased. In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is less. The Linux implementation as a discrepancy to the Standard basically adds the defined MinRTO to the calculated RTO. When comparing both approaches, the Linux calculation seems to perform worse for sender limited TCP flows like Telnet, SSH or constant bit rate encoded transmissions, especially for Round Trip Times (RTT) of 50ms to 800ms. Compared to the Linux implementation the RFC 6298 proposed RTO calculation performs better and more precise in adapting to current network characteristics. Extensive measurements for bulk data did not show a negative impact of the adjusted calculation. Exemplary performance comparison for sender-limited-flows: - Rate: 10Mbit/s - Delay: 200ms, Delay Variation: 10ms - Time between each scheduled segment: 1s - Amount Data Segments: 300 - Mean of 11 runs Mean Response Waiting Time [milliseconds] PER [%] | 0.5 11.5 2 3 5 7 10 +--- old | 206.4 208.6 218.0 218.6 227.2 249.3 274.7 308.2 new | 203.9 206.0 207.0 209.9 217.3 225.6 238.7 259.1 Detailed analysis: https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/ Reasoning for historic design: Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP. Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf Signed-off-by: Hagen Paul Pfeifer Signed-off-by: Daniel Metz Cc: Eric Dumazet Cc: Yuchung Cheng --- v3: - remove mdev_us v2: - Using the RFC 6298 compliant implementation, the tcp_sock struct variable u32 mdev_max_us becomes obsolete and consequently is being removed. - Add reference to Kuznetsov paper include/linux/tcp.h | 2 -- net/ipv4/tcp.c | 3 +- net/ipv4/tcp_input.c | 72 ++-- net/ipv4/tcp_metrics.c | 5 ++-- net/ipv4/tcp_minisocks.c | 1 - 5 files changed, 18 insertions(+), 65 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 7be9b12..3128eb1 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -230,8 +230,6 @@ struct tcp_sock { /* RTT measurement */ u32 srtt_us;/* smoothed round trip time << 3 in usecs */ - u32 mdev_us;/* medium deviation */ - u32 mdev_max_us;/* maximal mdev for the last rtt period */ u32 rttvar_us; /* smoothed mdev_max*/ u32 rtt_seq;/* sequence number to update rttvar */ struct rtt_meas { diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 5c7ed14..4a7597c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -386,7 +386,6 @@ void tcp_init_sock(struct sock *sk) INIT_LIST_HEAD(>tsq_node); icsk->icsk_rto = TCP_TIMEOUT_INIT; - tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT); tp->rtt_min[0].rtt = ~0U; /* So many TCP implementations out there (incorrectly) count the @@ -2703,7 +2702,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) info->tcpi_pmtu = icsk->icsk_pmtu_cookie; info->tcpi_rcv_ssthresh = tp->rcv_ssthresh; info->tcpi_rtt = tp->srtt_us >> 3; - info->tcpi_rttvar = tp->mdev_us >> 2; + info->tcpi_rttvar = tp->rttvar_us >> 2; info->tcpi_snd_ssthresh = tp->snd_ssthresh; info->tcpi_snd_cwnd = tp->snd_cwnd; info->tcpi_advmss = tp->advmss; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94d4aff..279f5f7 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb) /* Called to compute a smoothed rtt estimate. The data fed to this * routine either comes from timestamps, or from segments that were * known _not_ to have been retransmitted [see Karn/Partridge - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88 - * piece by Van Jacobson. + * Proceedings SIGCOMM 87]. * NOTE: the next three routines used to be one big routine. * To save cycles in the RFC 1323 implementation it was better to break * it up into three procedures. -- erics @@ -692,59 +691,19 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) long m = mrtt_us; /* RTT */ u32 srtt = tp->srtt_us; - /* The following amusing code comes from Jacobson's -* article in SIGCOMM '88. Note that rtt and mdev -* are scaled