[PATCH v7 net-next 0/2] tcp: Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. These patches have also been tested with a set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Changes: v7 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Changed sysctl_tcp_rdb to accept three values (Thanks Yuchung): - 0: Disable system wide (RDB cannot be enabled with TCP_RDB socket option) - 1: Allow enabling RDB with TCP_RDB socket option. - 2: Enable RDB by default on all TCP sockets and allow to modify with TCP_RDB * Added sysctl tcp_rdb_wait_congestion to control if RDB by default should wait for congestion before bundling. (Ref. comment by Eric on lossless conns) * Changed socket options to modify per-socket RDB settings: - Added flags to TCP_RDB to allow bundling without waiting for loss with TCP_RDB_BUNDLE_IMMEDIATE. - Added socket option TCP_RDB_MAX_BYTES: Set max bytes per RDB packet. - Added socket option TCP_RDB_MAX_PACKETS: Set max packets allowed to be bundled by RDB. * Added SNMP counter LINUX_MIB_TCPRDBLOSSREPAIRS to count the occurences where RDB repaired a loss (Thanks Eric). * Bundle on FIN packets (Thanks Yuchung). * Updated docs in Documentation/networking/{ip-sysctl.txt,tcp-thin.txt} * Removed flags parameter from tcp_rdb_ack_event() * Changed sysctl knobs to using network namespace. * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Changed sysctl knobs to using network namespace v6 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM) * Minor doc changes * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Minor doc changes v5 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric) * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric) * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c) * Merged the two if tests for max payload of RDB packet in rdb_can_bundle_test() * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss() and restructured to reduce indentation. * Improved docs * Revised commit message to be more detailed. * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c) v4 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Moved skb_append_data() to tcp_output.c and call this function from tcp_collapse_retrans() as well. * Merged functionality of create_rdb_skb() into tcp_transmit_rdb_skb() * Removed one parameter from rdb_can_bundle_test() v3 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Changed name of sysctl variable from tcp_rdb_max_skbs to tcp_rdb_max_packets after comment from Eric Dumazet about not exposing internal (kernel) names like skb. * Formatting and function docs fixes v2 (RFC/PATCH): * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Change calculation in tcp_stream_is_thin_dpifl based on feedback from Eric Dumazet. * tcp-Add-Redundant-Data-Bundling-RDB: * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) to reduce complexity as commented by Neal Cardwell. * Cleaned up loss detection code in rdb_check_rtx_queue_loss v1 (RFC/PATCH) Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 43 ++ Documentation/netw
[PATCH v7 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games, remote control systems, and VoIP, produce traffic with thin-stream characteristics, characterized by small packets and relatively high inter-transmission times (ITT). When experiencing packet loss, such latency-sensitive applications are heavily penalized by the need to retransmit lost packets, which increases the latency by a minimum of one RTT for the lost packet. Packets coming after a lost packet are held back due to head-of-line blocking, causing increased delays for all data segments until the lost packet has been retransmitted. RDB enables a TCP sender to bundle redundant (already sent) data with TCP packets containing small segments of new data. By resending un-ACKed data from the output queue in packets with new data, RDB reduces the need to retransmit data segments on connections experiencing sporadic packet loss. By avoiding a retransmit, RDB evades the latency increase of at least one RTT for the lost packet, as well as alleviating head-of-line blocking for the packets following the lost packet. This makes the TCP connection more resistant to latency fluctuations, and reduces the application layer latency significantly in lossy environments. Main functionality added: o When a packet is scheduled for transmission, RDB builds and transmits a new SKB containing both the unsent data as well as data of previously sent packets from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound. o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. RDB can be enabled on a connection with the socket option TCP_RDB or on all new connections by setting the sysctl variable net.ipv4.tcp_rdb=2 Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 35 + Documentation/networking/tcp-thin.txt | 188 -- include/linux/skbuff.h | 1 + include/linux/tcp.h| 11 +- include/net/netns/ipv4.h | 5 + include/net/tcp.h | 12 ++ include/uapi/linux/snmp.h | 1 + include/uapi/linux/tcp.h | 10 ++ net/core/skbuff.c | 2 +- net/ipv4/Makefile | 3 +- net/ipv4/proc.c| 1 + net/ipv4/sysctl_net_ipv4.c | 34 + net/ipv4/tcp.c | 42 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_ipv4.c| 5 + net/ipv4/tcp_output.c | 49 --- net/ipv4/tcp_rdb.c | 240 + 17 files changed, 579 insertions(+), 63 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index d856b98..d26d12b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -726,6 +726,41 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Controls the use of the Redundant Data Bundling (RDB) mechanism + for TCP connections. + + RDB is a TCP mechanism aimed at reducing the latency for + applications transmitting time-dependent data. By bundling already + sent data in packets with new data, RDB alleviates head-of-line + blocking on the receiver side by reducing the need to retransmit + data segments when packets are lost. See tcp-thin.txt for further + details. + Possible values: + 0 - Disable RDB system wide, i.e. disallow enabling RDB on TCP + sockets with the socket option TCP_RDB. + 1 - Allow enabling/disabling RDB with socket option TCP_RDB. + 2 - Set RDB to be enabled by default for all new TCP connections + and allow modifying socket with socket option TCP_RDB. + Default: 1 +
[PATCH v7 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams, tcp_stream_is_thin(), is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connection's RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/net/netns/ipv4.h | 1 + include/net/tcp.h | 21 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c| 1 + 5 files changed, 40 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 9ae9293..d856b98 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -718,6 +718,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound inter-transmission time (ITT) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). Based on + this threshold, a dynamic packets in flight limit (DPIFL) is + calculated, which is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index d061ffe..71be4ac 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -111,6 +111,7 @@ struct netns_ipv4 { int sysctl_tcp_orphan_retries; int sysctl_tcp_fin_timeout; unsigned int sysctl_tcp_notsent_lowat; + int sysctl_tcp_thin_dpifl_itt_lower_bound; int sysctl_igmp_max_memberships; int sysctl_igmp_max_msf; diff --git a/include/net/tcp.h b/include/net/tcp.h index a79894b..9956af9 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -214,6 +214,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* TCP initial congestion window as per rfc6928 */ #define TCP_INIT_CWND 10 @@ -1652,6 +1654,25 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Test if the stream is thin based on + * dynamic PIF limit (DPIFL) + * @sk: socket + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct sock *sk) +{ + /* Calculate the maximum allowed PIF limit by dividing the RTT by +* the minimum allowed inter-transmission time (ITT). +* Tests if PIF < RTT / ITT-lower-bound +*/ + return (u64) tcp_packets_in_flight(tcp_sk(sk)) * + sock_net(sk)->ipv4.sysctl_tcp_thin_dpifl_itt_lower_bound < + (tcp_sk(sk)->srtt_us >> 3); +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 1cb67de..150969d 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -960,6 +961,14 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "tcp_thin_dpifl_itt_lower_bound"
Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On 21/03/16 19:54, Yuchung Cheng wrote: > On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad > <bro.de...@gmail.com> wrote: >> >>>> diff --git a/Documentation/networking/ip-sysctl.txt >>>> b/Documentation/networking/ip-sysctl.txt >>>> index 6a92b15..8f3f3bf 100644 >>>> --- a/Documentation/networking/ip-sysctl.txt >>>> +++ b/Documentation/networking/ip-sysctl.txt >>>> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER >>>> calculated, which is used to classify whether a stream is thin. >>>> Default: 1 >>>> >>>> +tcp_rdb - BOOLEAN >>>> + Enable RDB for all new TCP connections. >>> Please describe RDB briefly, perhaps with a pointer to your paper. >> >> Ah, yes, that description may have been a bit too brief... >> >> What about pointing to tcp-thin.txt in the brief description, and >> rewrite tcp-thin.txt with a more detailed description of RDB along >> with a paper reference? > +1 >> >>>I suggest have three level of controls: >>>0: disable RDB completely >>>1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket >>> options >>>2: enable RDB on all thin-stream conn. by default >>> >>>currently it only provides mode 1 and 2. but there may be cases where >>>the administrator wants to disallow it (e.g., broken middle-boxes). >> >> Good idea. Will change this. I have implemented your suggestion in the next patch. >>> It also seems better to >>> allow individual socket to select the redundancy level (e.g., >>> setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. >>> This requires more bits in tcp_sock but 2-3 more is suffice. >> >> Most certainly. We decided not to implement this for the patch to keep >> it as simple as possible, however, we surely prefer to have this >> functionality included if possible. Next patch version has a socket option to allow modifying the different RDB settings. >>>> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, >>>> +unsigned int mss_now, gfp_t gfp_mask) >>>> +{ >>>> + struct sk_buff *rdb_skb = NULL; >>>> + struct sk_buff *first_to_bundle; >>>> + u32 bytes_in_rdb_skb = 0; >>>> + >>>> + /* How we detect that RDB was used. When equal, no RDB data was >>>> sent */ >>>> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; >>> >>>> + >>>> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) >>> During loss recovery tcp inflight fluctuates and would like to trigger >>> this check even for non-thin-stream connections. >> >> Good point. >> >>> Since the loss >>> already occurs, RDB can only take advantage from limited-transmit, >>> which it likely does not have (b/c its a thin-stream). It might be >>> checking if the state is open. >> >> You mean to test for open state to avoid calling rdb_can_bundle_test() >> unnecessarily if we (presume to) know it cannot bundle anyway? That >> makes sense, however, I would like to do some tests on whether "state >> != open" is a good indicator on when bundling is not possible. When testing this I found that bundling can often be performed when not in Open state. For the most part in CWR mode, but also the other modes, so this does not seem like a good indicator. The only problem with tcp_stream_is_thin_dpifl() triggering for non-thin streams in loss recovery would be the performance penalty of calling rdb_can_bundle_test(). It would not be able to bundle anyways since the previous SKB would contain >= mss worth of data. The most reliable test is to check available space in the previous SKB, i.e. if (xmit_skb->prev->len == mss_now). Do you suggest, for performance reasons, to do this before the call to tcp_stream_is_thin_dpifl()? >>> since RDB will cause DSACKs, and we only blindly count DSACKs to >>> perform CWND undo. How does RDB handle that false positives? >> >> That is a very good question. The simple answer is that the >> implementation does not handle any such false positives, which I >> expect can result in incorrectly undoing CWND reduction in some cases. >> This gets a bit complicated, so I'll have to do some more testing on >> this to verify with certainty when it happens. >> >> When there is no loss, and each RDB packet arriving at the receiver >> contains both
Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
>> diff --git a/Documentation/networking/ip-sysctl.txt >> b/Documentation/networking/ip-sysctl.txt >> index 6a92b15..8f3f3bf 100644 >> --- a/Documentation/networking/ip-sysctl.txt >> +++ b/Documentation/networking/ip-sysctl.txt >> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER >> calculated, which is used to classify whether a stream is thin. >> Default: 1 >> >> +tcp_rdb - BOOLEAN >> + Enable RDB for all new TCP connections. > Please describe RDB briefly, perhaps with a pointer to your paper. Ah, yes, that description may have been a bit too brief... What about pointing to tcp-thin.txt in the brief description, and rewrite tcp-thin.txt with a more detailed description of RDB along with a paper reference? >I suggest have three level of controls: >0: disable RDB completely >1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket > options >2: enable RDB on all thin-stream conn. by default > >currently it only provides mode 1 and 2. but there may be cases where >the administrator wants to disallow it (e.g., broken middle-boxes). Good idea. Will change this. >> + Default: 0 >> + >> +tcp_rdb_max_bytes - INTEGER >> + Enable restriction on how many bytes an RDB packet can contain. >> + This is the total amount of payload including the new unsent data. >> + Default: 0 >> + >> +tcp_rdb_max_packets - INTEGER >> + Enable restriction on how many previous packets in the output queue >> + RDB may include data from. A value of 1 will restrict bundling to >> + only the data from the last packet that was sent. >> + Default: 1 > why two metrics on redundancy? We have primarily used the packet based limit in our tests. This is also the most important knob as it directly controls how many lost packets each RDB packet may recover. We believe that the byte based limit can also be useful because it allows more fine grained control on how much impact RDB can have on the increased bandwidth requirements of the flows. If an application writes 700 bytes per write call, the bandwidth increase can be quite significant (even with a 1 packet bundling limit) if we consider a scenario with thousands of RDB streams. In some of our experiments with many simultaneous thin streams, where we set up a bottleneck rate limited by a htb with pfifo queue, we observed considerable difference in loss rates depending on how many bytes (packets) were allowed to be bundled with each packet. This is partly why we recommend a default bundling limit of 1 packet. By limiting the total payload size of RDB packets to e.g. 100 bytes, only the smallest segments will benefit from RDB, while the segments that would increase the bandwidth requirements the most, will not. While a very large number of RDB streams from one sender may be a corner case, we still think this sysctl knob can be valuable for a sysadmin that finds himself in such a situation. > It also seems better to > allow individual socket to select the redundancy level (e.g., > setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. > This requires more bits in tcp_sock but 2-3 more is suffice. Most certainly. We decided not to implement this for the patch to keep it as simple as possible, however, we surely prefer to have this functionality included if possible. >> +static unsigned int rdb_detect_loss(struct sock *sk) >> +{ ... >> + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { >> + if (before(TCP_SKB_CB(skb)->seq, >> scb->tx.rdb_start_seq)) >> + break; >> + packets_lost++; > since we only care if there is packet loss or not, we can return early here? Yes, I considered that, and as long as the number of packets presumed to be lost is not needed, that will suffice. However, could this not be useful for statistical purposes? This is also relevant to the comment from Eric on SNMP counters for how many times losses could be repaired by RDB? >> + } >> + break; >> + } >> + return packets_lost; >> +} >> + >> +/** >> + * tcp_rdb_ack_event() - initiate RDB loss detection >> + * @sk: socket >> + * @flags: flags >> + */ >> +void tcp_rdb_ack_event(struct sock *sk, u32 flags) > flags are not used Ah, yes, will remove that. >> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, >> +unsigned int mss_now, gfp_t gfp_mask) >> +{ >> + struct sk_buff *rdb_skb = NULL; >> + struct sk_buff *first_to_bundle; >> + u32 bytes_in_rdb_skb = 0; >> + >> + /* How we detect that RDB was used. When equal, no RDB data was sent >> */ >> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; > >> + >> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) > During loss recovery tcp inflight fluctuates and would like to trigger > this check even for non-thin-stream
Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On 14/03/16 22:15, Eric Dumazet wrote: > Acked-by: Eric Dumazet> > Note that RDB probably should get some SNMP counters, > so that we get an idea of how many times a loss could be repaired. Good idea. Simply count how many times an RDB packet successfully repaired loss? Note that this can be one or more lost packets. When bundling N packets, the RDB packet can repair up to N losses in the previous N packets that were sent. Which list should this be added to? snmp4_tcp_list? Any other counters that would be useful? Total number of RDB packets transmitted? > Ideally, if the path happens to be lossless, all these pro active > bundles are overhead. Might be useful to make RDB conditional to > tp->total_retrans or something. Yes, that is a good point. We have discussed this (for years really), but have not had the opportunity to investigate it in-depth. Having such a condition hard coded is not ideal, as it very much depends on the use case if bundling from the beginning is desirable. In most cases, this is probably a fair compromise, but preferably we would have some logic/settings to control how the bundling rate can be dynamically adjusted in response to certain events, defined by a set of given metrics. A conservative (default) setting would not do bundling until loss has been registered, and could also check against some smoothed loss indicator such that a certain amount of loss must have occurred within a specific time frame to allow bundling. This could be useful in cases where the network congestion varies greatly depending on such as the time of day/night. In a scenario where minimal application layer latency is very important, but only sporadic (single) packet loss is expected to regularly occur, always bundling one previous packet may be both sufficient and desirable. In the end, the best settings for an application/service depends on the degree to which application layer latency (both minimal and variations) affects the QoE. There are many possibilities to consider in this regard, and I expect we will not have this question fully explored any time soon. Most importantly, we should ensure that such logic can easily be added later on without breaking backwards compatibility. Suggestions and comments on this are very welcome. Bendik
Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
On 14/03/16 22:59, Yuchung Cheng wrote: > OK that makes sense. > > I left some detailed comments on the actual patches. I would encourage > to submit an IETF draft to gather feedback from tcpm b/c the feature > seems portable. Thank you for the suggestion, we appreciate the confidence. We have had in mind to eventually pursue a standardization process, but have been unsure about how a mechanism that actively introduces redundancy would be received by the IETF. It may now be the right time to propose the RDB mechanism, and we will aim to present an IEFT draft in the near future. Bendik
Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
On 03/10/2016 01:20 AM, Yuchung Cheng wrote: > I read the paper. I think the underlying idea is neat. but the > implementation is little heavy-weight that requires changes on fast > path (tcp_write_xmit) and space in skb control blocks. Yuchung, thank you for taking the time to review the patch submission and read the paper. I must admit I was not particularly happy about the extra if-test on the fast path, and I fully understand the wish to keep the fast path as simple and clean as possible. However, is the performance hit that significant considering the branch prediction hint for the non-RDB path? The extra variable needed in the SKB CB does not require increasing the CB buffer size due to the "tcp: refactor struct tcp_skb_cb" patch: http://patchwork.ozlabs.org/patch/510674 and uses only some of the space made available in the outgoing SKBs' CB. Therefore I hoped the extra variable would be acceptable. > ultimately this > patch is meant for a small set of specific applications. Yes, the RDB mechanism is aimed at a limited set of applications, specifically time-dependent applications that produce non-greedy, application limited (thin) flows. However, our hope is that RDB may greatly improve TCP's position as a viable alternative for applications transmitting latency sensitive data. > In my mental model (please correct me if I am wrong), losses on these > thin streams would mostly resort to RTOs instead of fast recovery, due > to the bursty nature of Internet losses. This depends on the transmission pattern of the applications, which varies to a great deal, also between the different types of time-dependent applications that produce thin streams. For short flows, (bursty) loss at the end will result in an RTO (if TLP does not probe), but the thin streams are often long lived, and the applications producing them continue to write small data segments to the socket at intervals of tens to hundreds of milliseconds. What controls if an RTO and not fast retransmit will resend the packet, is the number of PIFs, which directly correlates to how often the application writes data to the socket in relation to the RTT. As long as the number of packets successfully completing a round trip before the RTO is >= the dupACK threshold, they will not depend on RTOs (not considering TLP). Early retransmit and the TCP_THIN_DUPACK socket option will also affect the likelihood of RTOs vs fast retransmits. > The HOLB comes from RTO only > retransmit the first (tiny) unacked packet while a small of new data is > readily available. But since Linux congestion control is packet-based, > and loss cwnd is 1, the new data needs to wait until the 1st packet is > acked which is for another RTT. If I understand you correctly, you are referring to HOLB on the sender side, which is the extra delay on new data that is held back when the connection is CWND-limited. In the paper, we refer to this extra delay as increased sojourn times for the outgoing data segments. We do not include this additional sojourn time for the segments on the sender side in the ACK Latency plots (Fig. 4 in the paper). This is simply because the pcap traces contain the timestamps when the packets are sent, and not when the segments are added to the output queue. When we refer to the HOLB effect in the paper as well as the thesis, we refer to the extra delays (sojourn times) on the receiver side where segments are held back (not made available to user space) due to gaps in the sequence range when packets are lost (we had no reordering). So, when considering the increased delays due to HOLB on the receiver side, HOLB is not at all limited to RTOs. Actually, it's mostly not due to RTOs in the tests we've run, however, this also depends very much on the transmission pattern of the application as well as loss levels. In general, HOLB on the receiver side will affect any flow that transmits a packet with new data after a packet is lost (sender may not know yet), where the lost packet has not already been retransmitted. Consider a sender application that performs write calls every 30 ms on a 150 ms RTT link. It will need a CWND that allows 5-6 PIFs to be able to transmit all new data segments with no extra sojourn times on the sender side. When one packet is lost, the next 5 packets that are sent will be held back on the receiver side due to the missing segment (HOLB). In the best case scenario, the first dupACK triggers a fast retransmit around the same time as the fifth packet (after the lost packet) is sent. In that case, the first segment sent after the lost segment is held back on the receiver for 150 ms (the time it takes for the dupACK to reach the sender, and the fast retrans to arrive at the receiver). The second is held back 120 ms, the third 90 ms, the fourth 60 ms, an the fifth 30 ms. All of this extra delay is added before the sender even knows there was a loss. How it decides to react to the loss signal (dupACKs) will further decide how much
[PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb" (http://patchwork.ozlabs.org/patch/510674) These patches have also been tested with as set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Changes: v6 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM) * Minor doc changes * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Minor doc changes v5 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric) * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric) * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c) * Merged the two if tests for max payload of RDB packet in rdb_can_bundle_test() * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss() and restructured to reduce indentation. * Improved docs * Revised commit message to be more detailed. * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c) v4 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Moved skb_append_data() to tcp_output.c and call this function from tcp_collapse_retrans() as well. * Merged functionality of create_rdb_skb() into tcp_transmit_rdb_skb() * Removed one parameter from rdb_can_bundle_test() v3 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Changed name of sysctl variable from tcp_rdb_max_skbs to tcp_rdb_max_packets after comment from Eric Dumazet about not exposing internal (kernel) names like skb. * Formatting and function docs fixes v2 (RFC/PATCH): * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Change calculation in tcp_stream_is_thin_dpifl based on feedback from Eric Dumazet. * tcp-Add-Redundant-Data-Bundling-RDB: * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) to reduce complexity as commented by Neal Cardwell. * Cleaned up loss detection code in rdb_check_rtx_queue_loss v1 (RFC/PATCH) Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 23 include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 36 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 2 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 34 + net/ipv4/tcp.c | 16 ++- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 48 --- net/ipv4/tcp_rdb.c | 228 + 12 files changed, 375 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c -- 1.9.1
[PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games, remote control systems, and VoIP, produce traffic with thin-stream characteristics, characterized by small packets and relatively high inter-transmission times (ITT). When experiencing packet loss, such latency-sensitive applications are heavily penalized by the need to retransmit lost packets, which increases the latency by a minimum of one RTT for the lost packet. Packets coming after a lost packet are held back due to head-of-line blocking, causing increased delays for all data segments until the lost packet has been retransmitted. RDB enables a TCP sender to bundle redundant (already sent) data with TCP packets containing small segments of new data. By resending un-ACKed data from the output queue in packets with new data, RDB reduces the need to retransmit data segments on connections experiencing sporadic packet loss. By avoiding a retransmit, RDB evades the latency increase of at least one RTT for the lost packet, as well as alleviating head-of-line blocking for the packets following the lost packet. This makes the TCP connection more resistant to latency fluctuations, and reduces the application layer latency significantly in lossy environments. Main functionality added: o When a packet is scheduled for transmission, RDB builds and transmits a new SKB containing both the unsent data as well as data of previously sent packets from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound. o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. RDB can be enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable net.ipv4.tcp_rdb=1 Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 15 +++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 2 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 25 net/ipv4/tcp.c | 14 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 48 --- net/ipv4/tcp_rdb.c | 228 + 12 files changed, 335 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 6a92b15..8f3f3bf 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_packets - INTEGER + Enable restriction on how many previous packets in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 797cefb..0f2c9d1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned in
[PATCH v6 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams, tcp_stream_is_thin(), is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connection's RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/net/tcp.h | 21 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp.c | 2 ++ 4 files changed, 40 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index d5df40c..6a92b15 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound inter-transmission time (ITT) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). Based on + this threshold, a dynamic packets in flight limit (DPIFL) is + calculated, which is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/net/tcp.h b/include/net/tcp.h index 692db63..d38eae9 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* TCP initial congestion window as per rfc6928 */ #define TCP_INIT_CWND 10 @@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; +extern int sysctl_tcp_thin_dpifl_itt_lower_bound; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Test if the stream is thin based on + * dynamic PIF limit (DPIFL) + * @tp: the tcp_sock struct + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) +{ + /* Calculate the maximum allowed PIF limit by dividing the RTT by +* the minimum allowed inter-transmission time (ITT). +* Tests if PIF < RTT / ITT-lower-bound +*/ + return (u64) tcp_packets_in_flight(tp) * + sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3); +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 1e1fe60..f04320a 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -572,6 +573,14 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_thin_dpifl_itt_lower_bound", + .data = _tcp_thin_dpifl_itt_lower_bound, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_d
Re: [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On 03/02/2016 08:52 PM, David Miller wrote: > From: "Bendik Rønning Opstad" <bro.de...@gmail.com> > Date: Wed, 24 Feb 2016 22:12:55 +0100 > >> +/* tcp_rdb.c */ >> +void rdb_ack_event(struct sock *sk, u32 flags); > > Please name globally visible symbols with a prefix of "tcp_*", thanks. Yes, of course. I will fix that for the next version. Thanks. Bendik
[PATCH v5 net-next 0/2] tcp: Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb" (http://patchwork.ozlabs.org/patch/510674) These patches have also been tested with as set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Changes: v5 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric) * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric) * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c) * Merged the two if tests for max payload of RDB packet in rdb_can_bundle_test() * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss() and restructured to reduce indentation. * Improved docs * Revised commit message to be more detailed. * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c) v4 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Moved skb_append_data() to tcp_output.c and call this function from tcp_collapse_retrans() as well. * Merged functionality of create_rdb_skb() into tcp_transmit_rdb_skb() * Removed one parameter from rdb_can_bundle_test() v3 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Changed name of sysctl variable from tcp_rdb_max_skbs to tcp_rdb_max_packets after comment from Eric Dumazet about not exposing internal (kernel) names like skb. * Formatting and function docs fixes v2 (RFC/PATCH): * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Change calculation in tcp_stream_is_thin_dpifl based on feedback from Eric Dumazet. * tcp-Add-Redundant-Data-Bundling-RDB: * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) to reduce complexity as commented by Neal Cardwell. * Cleaned up loss detection code in rdb_check_rtx_queue_loss v1 (RFC/PATCH) Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 23 include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 36 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 2 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 34 + net/ipv4/tcp.c | 16 ++- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 47 --- net/ipv4/tcp_rdb.c | 225 + 12 files changed, 371 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c -- 1.9.1
[PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games, remote control systems, and VoIP, produce traffic with thin-stream characteristics, characterized by small packets and relatively high inter-transmission times (ITT). When experiencing packet loss, such latency-sensitive applications are heavily penalized by the need to retransmit lost packets, which increases the latency by a minimum of one RTT for the lost packet. Packets coming after a lost packet are held back due to head-of-line blocking, causing increased delays for all data segments until the lost packet has been retransmitted. RDB enables a TCP sender to bundle redundant (already sent) data with TCP packets containing small segments of new data. By resending un-ACKed data from the output queue in packets with new data, RDB reduces the need to retransmit data segments on connections experiencing sporadic packet loss. By avoiding a retransmit, RDB evades the latency increase of at least one RTT for the lost packet, as well as alleviating head-of-line blocking for the packets following the lost packet. This makes the TCP connection more resistant to latency fluctuations, and reduces the application layer latency significantly in lossy environments. Main functionality added: o When a packet is scheduled for transmission, RDB builds and transmits a new SKB containing both the unsent data as well as data of previously sent packets from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound. o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. RDB can be enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable net.ipv4.tcp_rdb=1 Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 15 +++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 2 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 25 net/ipv4/tcp.c | 14 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 47 --- net/ipv4/tcp_rdb.c | 225 + 12 files changed, 331 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 3b23ed8f..e3869e7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_packets - INTEGER + Enable restriction on how many previous packets in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index eab4f8f..af15a3c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2931,6 +2931,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned in
[PATCH v5 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams, tcp_stream_is_thin(), is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connection's RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/net/tcp.h | 21 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp.c | 2 ++ 4 files changed, 40 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 24ce97f..3b23ed8f 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound inter-transmission time (ITT) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). Based on + this threshold, a dynamic packets in flight limit (DPIFL) is + calculated, which is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/net/tcp.h b/include/net/tcp.h index 692db63..a29300a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* TCP initial congestion window as per rfc6928 */ #define TCP_INIT_CWND 10 @@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; +extern int sysctl_tcp_thin_dpifl_itt_lower_bound; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF + * limit + * @tp: the tcp_sock struct + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) +{ + /* Calculate the maximum allowed PIF limit by dividing the RTT by +* the minimum allowed inter-transmission time (ITT). +* Tests if PIF < RTT / ITT-lower-bound +*/ + return (u64) tcp_packets_in_flight(tp) * + sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3); +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 1e1fe60..f04320a 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -572,6 +573,14 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_thin_dpifl_itt_lower_bound", + .data = _tcp_thin_dpifl_itt_lower_bound, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_d
Re: [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On 02/18/2016 04:18 PM, Eric Dumazet wrote: >> >> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) >> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) >> { >> __copy_skb_header(new, old); >> >> @@ -1061,6 +1061,7 @@ static void copy_skb_header(struct sk_buff *new, const >> struct sk_buff *old) >> skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs; >> skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type; >> } >> +EXPORT_SYMBOL(copy_skb_header); > > Why are you exporting this ? tcp is statically linked into vmlinux. Ah, this is actually leftover from the earlier module based implementation of RDB. Will remove. >> +EXPORT_SYMBOL(skb_append_data); > > Same remark here. Will remove. > And this is really a tcp helper, you should add a tcp_ prefix. Certainly. > About rdb_build_skb() : I do not see where you make sure > @bytes_in_rdb_skb is not too big ? The number of previous SKBs in the queue to copy data from is given by rdb_can_bundle_test(), which tests if total payload does not exceed the MSS. Only if there is room (within the MSS) will it test the sysctl options to further restrict bundling: + /* We start at xmit_skb->prev, and go backwards. */ + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { + if ((total_payload + skb->len) > mss_now) + break; + + if (sysctl_tcp_rdb_max_bytes && + ((total_payload + skb->len) > sysctl_tcp_rdb_max_bytes)) + break; I'll combine these two to (total_payload + skb->len) > max_payload > tcp_rdb_max_bytes & tcp_rdb_max_packets seem to have no .extra2 upper > limit, so a user could do something really stupid and attempt to crash > the kernel. Those sysctl additions are actually a bit buggy, specifically the proc_handlers. Is it not sufficient to ensure that 0 is the lowest possible value? The max payload limit is really min(mss_now, sysctl_tcp_rdb_max_bytes), so if sysctl_tcp_rdb_max_bytes or sysctl_tcp_rdb_max_packets are set too large, bundling will simply be limited by the MSS. > Presumably I would use SKB_MAX_HEAD(MAX_TCP_HEADER) so that we do not > try high order page allocation. Do you suggest something like this?: bytes_in_rdb_skb = min_t(u32, bytes_in_rdb_skb, SKB_MAX_HEAD(MAX_TCP_HEADER)); Is this necessary when bytes_in_rdb_skb will always contain exactly the required number of bytes for the payload of the (RDB) packet, which will never be greater than mss_now? Or is it aimed at scenarios where the page size is so small that allocating to an MSS (of e.g. 1460) will require high order page allocation? Thanks for looking over the code! Bendik
[PATCH v4 net-next 0/2] tcp: Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb" (http://patchwork.ozlabs.org/patch/510674) These patches have also been tested with as set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Changes: v4 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Moved skb_append_data() to tcp_output.c and call this function from tcp_collapse_retrans() as well. * Merged functionality of create_rdb_skb() into tcp_transmit_rdb_skb() * Removed one parameter from rdb_can_bundle_test() v3 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Changed name of sysctl variable from tcp_rdb_max_skbs to tcp_rdb_max_packets after comment from Eric Dumazet about not exposing internal (kernel) names like skb. * Formatting and function docs fixes v2 (RFC/PATCH): * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Change calculation in tcp_stream_is_thin_dpifl based on feedback from Eric Dumazet. * tcp-Add-Redundant-Data-Bundling-RDB: * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) to reduce complexity as commented by Neal Cardwell. * Cleaned up loss detection code in rdb_check_rtx_queue_loss v1 (RFC/PATCH) Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 23 include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 36 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 35 ++ net/ipv4/tcp.c | 16 ++- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 49 +--- net/ipv4/tcp_rdb.c | 215 + 12 files changed, 365 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c -- 1.9.1
[PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
RDB is a mechanism that enables a TCP sender to bundle redundant (already sent) data with TCP packets containing new data. By bundling (retransmitting) already sent data with each TCP packet containing new data, the connection will be more resistant to sporadic packet loss which reduces the application layer latency significantly in congested scenarios. The main functionality added: o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. o When packets are scheduled for transmission, RDB replaces the SKB to be sent with a modified SKB containing the redundant data of previously sent data segments from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable tcp_thin_dpifl_itt_lower_bound. RDB is enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl net.ipv4.tcp_rdb=1 Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 15 +++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 26 net/ipv4/tcp.c | 14 ++- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 49 +--- net/ipv4/tcp_rdb.c | 215 + 12 files changed, 325 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 3b23ed8f..e3869e7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_packets - INTEGER + Enable restriction on how many previous packets in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 3920675..9075041 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2923,6 +2923,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index bcbf51d..c84de15 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -207,9 +207,10 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 unused; - u8 nonagle : 4,/* Disable Nagle algorithm? */ + u8 nonagle : 3,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ + rdb : 1,/* Redundant Data Bundling enabled */ repair :
[PATCH v4 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams (tcp_stream_is_thin) is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connections RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/net/tcp.h | 21 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp.c | 2 ++ 4 files changed, 40 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 24ce97f..3b23ed8f 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound inter-transmission time (ITT) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). Based on + this threshold, a dynamic packets in flight limit (DPIFL) is + calculated, which is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/net/tcp.h b/include/net/tcp.h index bdd5e1c..a23ce24 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* TCP initial congestion window as per rfc6928 */ #define TCP_INIT_CWND 10 @@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; +extern int sysctl_tcp_thin_dpifl_itt_lower_bound; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF + * limit + * @tp: the tcp_sock struct + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) +{ + /* Calculate the maximum allowed PIF limit by dividing the RTT by +* the minimum allowed inter-transmission time (ITT). +* Tests if PIF < RTT / ITT-lower-bound +*/ + return (u64) tcp_packets_in_flight(tp) * + sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3); +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index b537338..e7b1b30 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -595,6 +596,14 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_thin_dpifl_itt_lower_bound", + .data = _tcp_thin_dpifl_itt_lower_bound, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _d
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Sorry guys, I messed up that email by including HTML, and it got rejected by netdev@vger.kernel.org. I'll resend it properly formatted. Bendik On 08/02/16 18:17, Bendik Rønning Opstad wrote: > Eric, thank you for the feedback! > > On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote: >>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.duma...@gmail.com> > wrote: >>>> Really this looks very complicated. >>> >>> Can you be more specific? >> >> A lot of code added, needing maintenance cost for years to come. > > Yes, that is understandable. > >>>> Why not simply append the new skb content to prior one ? >>> >>> It's not clear to me what you mean. At what stage in the output engine >>> do you refer to? >>> >>> We want to avoid modifying the data of the SKBs in the output queue, >> >> Why ? We already do that, as I pointed out. > > I suspect that we might be talking past each other. It wasn't clear to > me that we were discussing how to implement this in a different way. > > The current retrans collapse functionality only merges SKBs that > contain data that has already been sent and is about to be > retransmitted. > > This differs significantly from RDB, which combines both already > transmitted data and unsent data in the same packet without changing > how the data is stored (and the state tracked) in the output queue. > Another difference is that RDB includes un-ACKed data that is not > considered lost. > >>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code). >>> The header and payload of the first SKB containing data we want to >>> redundantly transmit is then copied. Then the payload of the SKBs > following >>> next in the output queue is appended onto the rdb_skb. The last payload >>> that is appended is from the first SKB with unsent data, i.e. the >>> sk_send_head. >>> >>> Would you suggest a different approach? >>> >>>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb > is >>>> really available (ie its clone not sitting/waiting in a qdisc on the >>>> host) >>> >>> Where do you suggest this should be used? >> >> To detect if appending data to prior skb is possible. > > I see. As the implementation intentionally avoids modifying SKBs in > the output queue, this was not obvious. > >> If the prior packet is still in qdisc, no change is allowed, >> and it is fine : DRB should not trigger anyway. > > Actually, whether the data in the prior SKB is on the wire or is still > on the host (in qdisc/driver queue) is not relevant. RDB always wants > to redundantly resend the data if there is room in the packet, because > the previous packet may become lost. > >>>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - >>>> MAX_TCP_HEADER) available bytes in skb->data. >>> >>> Sure, rdb_build_skb() could use this instead of the calculated >>> bytes_in_rdb_skb. >> >> Point is : small packets already have tail room in skb->head > > Yes, I'm aware of that. But we do not allocate new SKBs because we > think the existing SKBs do not have enough space available. We do it > to avoid modifications to the SKBs in the output queue. > >> When RDB decides a packet should be merged into the prior one, you can >> simply copy payload into the tailroom, then free the skb. >> >> No skb allocations are needed, only freeing. > > It wasn't clear to me that you suggest a completely different > implementation approach altogether. > > As I understand you, the approach you suggest is as follows: > > 1. An SKB containing unsent data is processed for transmission (lets >call it T_SKB) > 2. Check if the previous SKB (lets call it P_SKB) (containing sent but >un-ACKed data) has available (tail) room for the payload contained >in T_SKB. > 3. If room in P_SKB: > * Copy the unsent data from T_SKB to P_SKB by appending it to the > linear data and update sequence numbers. > * Remove T_SKB (which contains only the new and unsent data) from > the output queue. > * Transmit P_SKB, which now contains some already sent data and some > unsent data. > > > If I have misunderstood, can you please elaborate in detail what you > mean? > > If this is the approach you suggest, I can think of some potential > downsides that require further considerations: > > > 1) ACK-accountin
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Eric, thank you for the feedback! On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote: >> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >>> Really this looks very complicated. >> >> Can you be more specific? > > A lot of code added, needing maintenance cost for years to come. Yes, that is understandable. >>> Why not simply append the new skb content to prior one ? >> >> It's not clear to me what you mean. At what stage in the output engine >> do you refer to? >> >> We want to avoid modifying the data of the SKBs in the output queue, > > Why ? We already do that, as I pointed out. I suspect that we might be talking past each other. It wasn't clear to me that we were discussing how to implement this in a different way. The current retrans collapse functionality only merges SKBs that contain data that has already been sent and is about to be retransmitted. This differs significantly from RDB, which combines both already transmitted data and unsent data in the same packet without changing how the data is stored (and the state tracked) in the output queue. Another difference is that RDB includes un-ACKed data that is not considered lost. >> therefore we allocate a new SKB (This SKB is named rdb_skb in the code). >> The header and payload of the first SKB containing data we want to >> redundantly transmit is then copied. Then the payload of the SKBs following >> next in the output queue is appended onto the rdb_skb. The last payload >> that is appended is from the first SKB with unsent data, i.e. the >> sk_send_head. >> >> Would you suggest a different approach? >> >>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is >>> really available (ie its clone not sitting/waiting in a qdisc on the >>> host) >> >> Where do you suggest this should be used? > > To detect if appending data to prior skb is possible. I see. As the implementation intentionally avoids modifying SKBs in the output queue, this was not obvious. > If the prior packet is still in qdisc, no change is allowed, > and it is fine : DRB should not trigger anyway. Actually, whether the data in the prior SKB is on the wire or is still on the host (in qdisc/driver queue) is not relevant. RDB always wants to redundantly resend the data if there is room in the packet, because the previous packet may become lost. >>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - >>> MAX_TCP_HEADER) available bytes in skb->data. >> >> Sure, rdb_build_skb() could use this instead of the calculated >> bytes_in_rdb_skb. > > Point is : small packets already have tail room in skb->head Yes, I'm aware of that. But we do not allocate new SKBs because we think the existing SKBs do not have enough space available. We do it to avoid modifications to the SKBs in the output queue. > When RDB decides a packet should be merged into the prior one, you can > simply copy payload into the tailroom, then free the skb. > > No skb allocations are needed, only freeing. It wasn't clear to me that you suggest a completely different implementation approach altogether. As I understand you, the approach you suggest is as follows: 1. An SKB containing unsent data is processed for transmission (lets call it T_SKB) 2. Check if the previous SKB (lets call it P_SKB) (containing sent but un-ACKed data) has available (tail) room for the payload contained in T_SKB. 3. If room in P_SKB: * Copy the unsent data from T_SKB to P_SKB by appending it to the linear data and update sequence numbers. * Remove T_SKB (which contains only the new and unsent data) from the output queue. * Transmit P_SKB, which now contains some already sent data and some unsent data. If I have misunderstood, can you please elaborate in detail what you mean? If this is the approach you suggest, I can think of some potential downsides that require further considerations: 1) ACK-accounting will work differently When the previous SKB (P_SKB) is modified by appending the data of the next SKB (T_SKB), what should happen when an incoming ACK acknowledges the data that was sent in the original transmission (before the SKB was modified), but not the data that was appended later? tcp_clean_rtx_queue currently handles partially ACKed SKBs due to TSO, in which case the tcp_skb_pcount(skb) > 1. So this function would need to be modified to handle this for RDB modified SKBs in the queue, where all the data is located in the linear data buffer (no GSO segs). How should SACK and retrans flags be handled when one SKB in the output queue can represent multiple transmitted packets? 2) Ti
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote: >> >> o When packets are scheduled for transmission, RDB replaces the SKB to >> be sent with a modified SKB containing the redundant data of >> previously sent data segments from the TCP output queue. > > Really this looks very complicated. Can you be more specific? > Why not simply append the new skb content to prior one ? It's not clear to me what you mean. At what stage in the output engine do you refer to? We want to avoid modifying the data of the SKBs in the output queue, therefore we allocate a new SKB (This SKB is named rdb_skb in the code). The header and payload of the first SKB containing data we want to redundantly transmit is then copied. Then the payload of the SKBs following next in the output queue is appended onto the rdb_skb. The last payload that is appended is from the first SKB with unsent data, i.e. the sk_send_head. Would you suggest a different approach? > skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is > really available (ie its clone not sitting/waiting in a qdisc on the > host) Where do you suggest this should be used? > Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - > MAX_TCP_HEADER) available bytes in skb->data. Sure, rdb_build_skb() could use this instead of the calculated bytes_in_rdb_skb. > Also note that tcp_collapse_retrans() is very similar to your needs. You > might simply expand it. The functionality shared is the copying of data from one SKB to another, as well as adjusting sequence numbers and checksum. Unlinking SKBs from the output queue, modifying the data of SKBs in the output queue, and changing retrans hints is not shared. To reduce code duplication, the function skb_append_data in tcp_rdb.c could be moved to tcp_output.c, and then be called from tcp_collapse_retrans. Is it something like this you had in mind? Bendik
[PATCH v3 net-next 0/2] tcp: Redundant Data Bundling (RDB)
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb" (http://patchwork.ozlabs.org/patch/510674) These patches have also been tested with as set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Changes: v3 (PATCH): * tcp-Add-Redundant-Data-Bundling-RDB: * Changed name of sysctl variable from tcp_rdb_max_skbs to tcp_rdb_max_packets after comment from Eric Dumazet about not exposing internal (kernel) names like skb. * Formatting and function docs fixes v2 (RFC/PATCH): * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Change calculation in tcp_stream_is_thin_dpifl based on feedback from Eric Dumazet. * tcp-Add-Redundant-Data-Bundling-RDB: * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) to reduce complexity as commented by Neal Cardwell. * Cleaned up loss detection code in rdb_check_rtx_queue_loss v1 (RFC/PATCH) Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 23 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 35 + include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 35 + net/ipv4/tcp.c | 16 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 273 + 12 files changed, 399 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c -- 1.9.1
[PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
RDB is a mechanism that enables a TCP sender to bundle redundant (already sent) data with TCP packets containing new data. By bundling (retransmitting) already sent data with each TCP packet containing new data, the connection will be more resistant to sporadic packet loss which reduces the application layer latency significantly in congested scenarios. The main functionality added: o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. o When packets are scheduled for transmission, RDB replaces the SKB to be sent with a modified SKB containing the redundant data of previously sent data segments from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable tcp_thin_dpifl_itt_lower_bound. RDB is enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable tcp_rdb=1. Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 ++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 14 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 26 net/ipv4/tcp.c | 14 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 273 + 12 files changed, 359 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index eb42853..14f960d 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_packets - INTEGER + Enable restriction on how many previous packets in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 11f935c..eb81877 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2914,6 +2914,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index b386361..da6dae8 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -202,9 +202,10 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 unused; - u8 nonagle : 4,/* Disable Nagle algorithm? */ + u8 nonagle : 3,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ + rdb : 1,/* Redundant Data Bundling enabled */ repair : 1,
[PATCH v3 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams (tcp_stream_is_thin) is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connections RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/net/tcp.h | 21 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp.c | 2 ++ 4 files changed, 40 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 73b36d7..eb42853 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound inter-transmission time (ITT) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). Based on + this threshold, a dynamic packets in flight limit (DPIFL) is + calculated, which is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/net/tcp.h b/include/net/tcp.h index 3dd20fe..2d86bd7 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* TCP initial congestion window as per rfc6928 */ #define TCP_INIT_CWND 10 @@ -271,6 +273,7 @@ extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; +extern int sysctl_tcp_thin_dpifl_itt_lower_bound; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -1649,6 +1652,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF + * limit + * @tp: the tcp_sock struct + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) +{ + /* Calculate the maximum allowed PIF limit by dividing the RTT by +* the minimum allowed inter-transmission time (ITT). +* Tests if PIF < RTT / ITT-lower-bound +*/ + return (u64) tcp_packets_in_flight(tp) * + sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3); +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 4d367b4..6014bc4 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -687,6 +688,14 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_thin_dpifl_itt_lower_bound", + .data = _tcp_thin_dpifl_itt_lower_bound, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _d
Re: [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On 23/11/15 18:43, Eric Dumazet wrote: > On Mon, 2015-11-23 at 17:26 +0100, Bendik Rønning Opstad wrote: > >> > + >> > +tcp_rdb_max_skbs - INTEGER >> > + Enable restriction on how many previous SKBs in the output queue >> > + RDB may include data from. A value of 1 will restrict bundling to >> > + only the data from the last packet that was sent. >> > + Default: 1 >> > + > skb is an internal thing. I would rather not expose a sysctl with such > name. > > Can be multi segment or not (if GSO/TSO is enabled) > > So even '1' skb can have very different content, from 1 byte to ~64 KB I see your point about not exposing the internal naming. What about tcp_rdb_max_packets? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC v2 net-next 0/2] tcp: Redundant Data Bundling (RDB)
This is a request for comments. Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. Note: Current patch set depends on a recently submitted patch for tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674) These patches have been tested with as set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Changes: v2: * tcp-Add-DPIFL-thin-stream-detection-mechanism: * Change calculation in tcp_stream_is_thin_dpifl based on feedback from Eric Dumazet. * tcp-Add-Redundant-Data-Bundling-RDB: * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) to reduce complexity as commented by Neal Cardwell. * Cleaned up loss detection code in rdb_check_rtx_queue_loss Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 23 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 35 + include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 35 + net/ipv4/tcp.c | 16 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 271 + 12 files changed, 397 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
RDB is a mechanism that enables a TCP sender to bundle redundant (already sent) data with TCP packets containing new data. By bundling (retransmitting) already sent data with each TCP packet containing new data, the connection will be more resistant to sporadic packet loss which reduces the application layer latency significantly in congested scenarios. The main functionality added: o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. o When packets are scheduled for transmission, RDB replaces the SKB to be sent with a modified SKB containing the redundant data of previously sent data segments from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable tcp_thin_dpifl_itt_lower_bound. RDB is enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable tcp_rdb=1. Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 ++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 14 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 26 net/ipv4/tcp.c | 14 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 271 + 12 files changed, 357 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 938ae73..1077de1 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_skbs - INTEGER + Enable restriction on how many previous SKBs in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c9c394b..1639cc0 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2806,6 +2806,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index b386361..da6dae8 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -202,9 +202,10 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 unused; - u8 nonagle : 4,/* Disable Nagle algorithm? */ + u8 nonagle : 3,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ + rdb : 1,/* Redundant Data Bundling enabled */ repair : 1, frt
[PATCH RFC v2 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams (tcp_stream_is_thin) is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connections RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/net/tcp.h | 21 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp.c | 2 ++ 4 files changed, 40 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 2ea4c45..938ae73 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound inter-transmission time (ITT) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). Based on + this threshold, a dynamic packets in flight limit (DPIFL) is + calculated, which is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/net/tcp.h b/include/net/tcp.h index 4fc457b..deac96f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */ #define TCP_INIT_CWND 10 @@ -274,6 +276,7 @@ extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; +extern int sysctl_tcp_thin_dpifl_itt_lower_bound; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -1631,6 +1634,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF + * limit + * @tp: the tcp_sock struct + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) +{ + /* Calculate the maximum allowed PIF limit by dividing the RTT by +* the minimum allowed inter-transmission time (ITT). +* Tests if PIF < RTT / ITT-lower-bound +*/ + return (u64) tcp_packets_in_flight(tp) * + sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3); +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index a0bd7a5..5b12446 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int tcp_thin_dpifl_itt_lower_bound_min = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; /* Update system visible IP port range */ static void set_local_port_range(struct net *net, int range[2]) @@ -709,6 +710,14 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_thin_dpifl_itt_lower_bound", + .data = _tcp_thin_dpifl_itt_lower_bound, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _d
Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
On 24/10/15 14:57, Eric Dumazet wrote: > Thank you for this very high quality patch submission. > > Please give us a few days for proper evaluation. > > Thanks ! Guys, thank you very much for taking the time to evaluate this. Since there haven't been any more feedback or comments I'll submit an RFCv2 with a few changes which includes removing the Nagle modification. After discussing the Nagle change on setsockopt we realize that it should be evaluated more thoroughly, and is better left for a later patch submission. Bendik P.S. Trimming the CC list to only those who have responded as gmail says I'm spamming :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Monday, November 02, 2015 09:37:54 AM David Laight wrote: > From: Bendik Rønning Opstad > > Sent: 23 October 2015 21:50 > > RDB is a mechanism that enables a TCP sender to bundle redundant > > (already sent) data with TCP packets containing new data. By bundling > > (retransmitting) already sent data with each TCP packet containing new > > data, the connection will be more resistant to sporadic packet loss > > which reduces the application layer latency significantly in congested > > scenarios. > > What sort of traffic flows do you expect this to help? As mentioned in the cover letter, RDB is aimed at reducing the latencies for "thin-stream" traffic often produced by latency-sensitive applications. This blog post describes RDB and the underlying motivation: http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp Further information is available in the links referred to in the blog post. > An ssh (or similar) connection will get additional data to send, > but that sort of data flow needs Nagle in order to reduce the > number of packets sent. Whether an application needs to reduce the number of packets sent depends on the perspective of who you ask. If low latency is of high priority for the application it may need to increase the number of packets sent by disabling Nagle to reduce the segments sojourn times on the sender side. As for SSH clients, it seems OpenSSH disables Nagle for interactive sessions. > OTOH it might benefit from including unacked data if the Nagle > timer expires. > Being able to set the Nagle timer on a per-connection basis > (or maybe using something based on the RTT instead of 2 secs) > might make packet loss less problematic. There is no timer for Nagle? The current (Minshall variant) implementation restricts sending a small segment as long as the previously transmitted packet was small and is not yet ACKed. > Data flows that already have Nagle disabled (probably anything that > isn't command-response and isn't unidirectional bulk data) are > likely to generate a lot of packets within the RTT. How many packets such applications need to transmit for optimal latency varies to a great extent. Packets per RTT is not a very useful metric in this regard, considering the strict dependency on the RTT. This is why we propose a dynamic packets in flight limit (DPIFL) that indirectly relies on the application write frequency, i.e. how often the application performs write systems calls. This limit is used to ensure that only applications that write data less frequently than a certain limit may utilize RDB. > Resending unacked data will just eat into available network bandwidth > and could easily make any congestion worse. > > I think that means you shouldn't resend data more than once, and/or > should make sure that the resent data isn't a significant overhead > on the packet being sent. It is important to remember what type of traffic flows we are discussing. The applications RDB is aimed at helping produce application-limited flows that transmit small amounts of data, both in terms of payload per packet and packets per second. Analysis of traces from latency-sensitive applications producing traffic with thin-stream characteristics show inter-transmission times ranging from a few ms (typically 20-30 ms on average) to many hundred ms. (http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#thin_streams) Increasing the amount of transmitted data will certainly contribute to congestion to some degree, but it is not (necessarily) an unreasonable trade-off considering the relatively small amounts of data such applications transmit compared to greedy flows. RDB does not cause more packets to be sent through the network, as it uses available "free" space in packets already scheduled for transmission. With a bundling limitation of only one previous segment, the bandwidth requirement is doubled - accounting for headers it would be less. By increasing the BW requirement for an application that produces relatively little data, we still end up with a low BW requirement. The suggested minimum lower bound inter-transmission time is 10 ms, meaning that when an application writes data more frequently than every 10 ms (on average) it will not be allowed to utilize RDB. To what degree RDB affects competing traffic will of course depend on the link capacity and the number of simultaneous flows utilizing RDB. We have performed tests to asses how RDB affects competing traffic. In one of the test scenarios, 10 RDB-enabled thin streams and 10 regular TCP thin streams compete against 5 greedy TCP flows over a shared bottleneck limited to 5Mbit/s. The results from this test show that by only bundling one previous segment with each packet (segment size: 120 bytes), the effect on the the competing thin-stream traffic is modest. (http://mlab.no/blog/2015/10/redundant-d
Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Monday, October 26, 2015 02:58:03 PM Yuchung Cheng wrote: > On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetl...@simula.no> wrote: > > > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardw...@google.com> wrote: > > > > > > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad > > > > > > <bro.de...@gmail.com> wrote: > > >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk, > > >> int level,> > > > > ... > > > > > >> + case TCP_RDB: > > >> + if (val < 0 || val > 1) { > > >> + err = -EINVAL; > > >> + } else { > > >> + tp->rdb = val; > > >> + tp->nonagle = val; > > > > > > The semantics of the tp->nonagle bits are already a bit complex. My > > > sense is that having a setsockopt of TCP_RDB transparently modify the > > > nagle behavior is going to add more extra complexity and unanticipated > > > behavior than is warranted given the slight possible gain in > > > convenience to the app writer. What about a model where the > > > application user just needs to remember to call > > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be > > > sensible? I see your nice tests at > > > > > > https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b > > > 7d8baf703b2c2ac1b> > > > > are already doing that. And my sense is that likewise most > > > well-engineered "thin stream" apps will already be using > > > setsockopt(TCP_NODELAY). Is that workable? This is definitely workable. I agree that it may not be an ideal solution to have TCP_RDB disable Nagle, however, it would be useful with a way to easily enable RDB and disable Nagle. > > We have been discussing this a bit back and forth. Your suggestion would > > be the right thing to keep the nagle semantics less complex and to > > educate developers in the intrinsics of the transport. > > > > We ended up choosing to implicitly disable nagle since it > > 1) is incompatible with the logic of RDB. > > 2) leaving it up to the developer to read the documentation and register > > the line saying that "failing to set TCP_NODELAY will void the RDB > > latency gain" will increase the chance of misconfigurations leading to > > deployment with no effect. > > > > The hope was to help both the well-engineered thin-stream apps and the > > ones deployed by developers with less detailed knowledge of the > > transport. > but would RDB be voided if this developer turns on RDB then turns on > Nagle later? It would (to a large degree), but I believe that's ok? The intention with also disabling Nagle is not to remove control from the application writer, so if TCP_RDB disables Nagle, they should not be prevented from explicitly enabling Nagle after enabling RDB. The idea is to make it as easy as possible for the application writer, and since Nagle is on by default, it makes sense to change this behavior when the application has indicated that it values low latencies. Would a solution with multiple option values to TCP_RDB be acceptable? E.g. 0 = Disable 1 = Enable RDB 2 = Enable RDB and disable Nagle If the sysctl tcp_rdb accepts the same values, setting the sysctl to 2 would allow to use and test RDB (with Nagle off) on applications that haven't explicitly disabled Nagle, which would make the sysctl tcp_rdb even more useful. Instead of having TCP_RDB modify Nagle, would it be better/acceptable to have a separate socket option (e.g. TCP_THIN/TCP_THIN_LOW_LATENCY) that enables RDB and disables Nagle? e.g. 0 = Use default system options? 1 = Enable RDB and disable Nagle This would separate the modification of Nagle from the TCP_RDB socket option and make it cleaner? Such an option could also enable other latency-reducing options like TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK: 2 = Enable RDB, TCP_THIN_LINEAR_TIMEOUTS, TCP_THIN_DUPACK, and disable Nagle Bendik -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
On Friday, October 23, 2015 02:44:14 PM Eric Dumazet wrote: > On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote: > > > > > +/** > > + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on > > dynamic PIF > > + * limit > > + * @tp: the tcp_sock struct > > + * > > + * Return: true if current packets in flight (PIF) count is lower than > > + * the dynamic PIF limit, else false > > + */ > > +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) > > +{ > > + u64 dpif_lim = tp->srtt_us >> 3; > > + /* Div by is_thin_min_itt_lim, the minimum allowed ITT > > +* (Inter-transmission time) in usecs. > > +*/ > > + do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound); > > + return tcp_packets_in_flight(tp) < dpif_lim; > > +} > > + > This is very strange : > > You are using a do_div() while both operands are 32bits. A regular > divide would be ok : > > u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound; > > But then, you can avoid the divide by using a multiply, less expensive : > > return(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound > < > (tp->srtt_us >> 3); > You are of course correct. Will fix this and use multiply. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism
The existing mechanism for detecting thin streams (tcp_stream_is_thin) is based on a static limit of less than 4 packets in flight. This treats streams differently depending on the connections RTT, such that a stream on a high RTT link may never be considered thin, whereas the same application would produce a stream that would always be thin in a low RTT scenario (e.g. data center). By calculating a dynamic packets in flight limit (DPIFL), the thin stream detection will be independent of the RTT and treat streams equally based on the transmission pattern, i.e. the inter-transmission time (ITT). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 8 include/linux/tcp.h| 6 ++ include/net/tcp.h | 20 net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp.c | 3 +++ 5 files changed, 46 insertions(+) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 85752c8..b841a76 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_thin_dpifl_itt_lower_bound - INTEGER + Controls the lower bound for ITT (inter-transmission time) threshold + for when a stream is considered thin. The value is specified in + microseconds, and may not be lower than 1 (10 ms). This theshold + is used to calculate a dynamic packets in flight limit (DPIFL) which + is used to classify whether a stream is thin. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/tcp.h b/include/linux/tcp.h index c906f45..fc885db 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -269,6 +269,12 @@ struct tcp_sock { struct sk_buff* lost_skb_hint; struct sk_buff *retransmit_skb_hint; + /* The limit used to identify when a stream is thin based in a minimum +* allowed inter-transmission time (ITT) in microseconds. This is used +* to dynamically calculate a max packets in flight limit (DPIFL). + */ + int thin_dpifl_itt_lower_bound; + /* OOO segments go in this list. Note that socket lock must be held, * as we do not use sk_buff_head lock. */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 4fc457b..6534836 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -215,6 +215,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* TCP thin-stream limits */ #define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */ +#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1 /* Minimum lower bound is 10 ms (1 usec) */ /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */ #define TCP_INIT_CWND 10 @@ -274,6 +275,7 @@ extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; +extern int sysctl_tcp_thin_dpifl_itt_lower_bound; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -1631,6 +1633,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock *tp) return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp); } +/** + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic PIF + * limit + * @tp: the tcp_sock struct + * + * Return: true if current packets in flight (PIF) count is lower than + * the dynamic PIF limit, else false + */ +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp) +{ + u64 dpif_lim = tp->srtt_us >> 3; + /* Div by is_thin_min_itt_lim, the minimum allowed ITT +* (Inter-transmission time) in usecs. +*/ + do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound); + return tcp_packets_in_flight(tp) < dpif_lim; +} + /* /proc */ enum tcp_seq_states { TCP_SEQ_STATE_LISTENING, diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 25300c5..917fdde 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = {
[PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
This is a request for comments. Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games and remote desktop, produce traffic with thin-stream characteristics, characterized by small packets and a relatively high ITT. By bundling already sent data in packets with new data, RDB alleviates head-of-line blocking by reducing the need to retransmit data segments when packets are lost. RDB is a continuation on the work on latency improvements for TCP in Linux, previously resulting in two thin-stream mechanisms in the Linux kernel (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). The RDB implementation has been thoroughly tested, and shows significant latency reductions when packet loss occurs[1]. The tests show that, by imposing restrictions on the bundling rate, it can be made not to negatively affect competing traffic in an unfair manner. Note: Current patch set depends on a recently submitted patch for tcp_skb_cb (tcp: refactor struct tcp_skb_cb: http://patchwork.ozlabs.org/patch/510674) These patches have been tested with as set of packetdrill scripts located at https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb (The tests require patching packetdrill with a new socket option: https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) Detailed info about the RDB mechanism can be found at http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the paper "Latency and Fairness Trade-Off for Thin Streams using Redundant Data Bundling in TCP"[2]. [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf Bendik Rønning Opstad (2): tcp: Add DPIFL thin stream detection mechanism tcp: Add Redundant Data Bundling (RDB) Documentation/networking/ip-sysctl.txt | 23 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 9 +- include/net/tcp.h | 34 include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 35 net/ipv4/tcp.c | 19 ++- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 281 + 12 files changed, 415 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
RDB is a mechanism that enables a TCP sender to bundle redundant (already sent) data with TCP packets containing new data. By bundling (retransmitting) already sent data with each TCP packet containing new data, the connection will be more resistant to sporadic packet loss which reduces the application layer latency significantly in congested scenarios. The main functionality added: o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. o When packets are scheduled for transmission, RDB replaces the SKB to be sent with a modified SKB containing the redundant data of previously sent data segments from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable tcp_thin_dpifl_itt_lower_bound. RDB is enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable tcp_rdb=1. Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Pål Halvorsen <pa...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kristian Evensen <kristian.even...@gmail.com> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 ++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 14 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 26 +++ net/ipv4/tcp.c | 16 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 281 + 12 files changed, 369 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b841a76..740e6a3 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_skbs - INTEGER + Enable restriction on how many previous SKBs in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 24f4dfd..3572d21 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2809,6 +2809,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index fc885db..f38b889 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -202,9 +202,10 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 unused; - u8 nonagle : 4,/* Disable Nagle algorithm? */ + u8 nonagle : 3,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ + rdb : 1,/* Redundant Data Bundling enabled */ repair : 1, frto: 1;/
[PATCH v2 net-next] tcp: Fix CWV being too strict on thin streams
Application limited streams such as thin streams, that transmit small amounts of payload in relatively few packets per RTT, can be prevented from growing the CWND when in congestion avoidance. This leads to increased sojourn times for data segments in streams that often transmit time-dependent data. Currently, a connection is considered CWND limited only after having successfully transmitted at least one packet with new data, while at the same time failing to transmit some unsent data from the output queue because the CWND is full. Applications that produce small amounts of data may be left in a state where it is never considered to be CWND limited, because all unsent data is successfully transmitted each time an incoming ACK opens up for more data to be transmitted in the send window. Fix by always testing whether the CWND is fully used after successful packet transmissions, such that a connection is considered CWND limited whenever the CWND has been filled. This is the correct behavior as specified in RFC2861 (section 3.1). Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Cc: Mads Johannessen <mads...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- Changes in v2: - Instead of updating tp->is_cwnd_limited after failing to send any packets, test whether CWND is full after successfull packet transmissions. - Updating commit message according to changes. net/ipv4/tcp_output.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 4cd0b50..57a586f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1827,7 +1827,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, /* Ok, it looks like it is advisable to defer. */ - if (cong_win < send_win && cong_win < skb->len) + if (cong_win < send_win && cong_win <= skb->len) *is_cwnd_limited = true; return true; @@ -2060,7 +2060,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, cwnd_quota = tcp_cwnd_test(tp, skb); if (!cwnd_quota) { - is_cwnd_limited = true; if (push_one == 2) /* Force out a loss probe pkt. */ cwnd_quota = 1; @@ -2142,6 +2141,7 @@ repair: /* Send one loss probe per tail loss episode. */ if (push_one != 2) tcp_schedule_loss_probe(sk); + is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd); tcp_cwnd_validate(sk, is_cwnd_limited); return false; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams
> Thanks for this report! Thank you for answering! > When you say "CWND is reduced due to loss", are you talking about RTO > or Fast Recovery? Do you have any traces you can share that illustrate > this issue? The problem is not related to loss recovery, but only occurs in congestion avoidance state. This is because tcp_is_cwnd_limited(), called from the congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when in congestion avoidance. I've made two traces to compare the effect with and without the patch: http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/ To ensure that the results are comparable, the traces are produced with an extra patch applied (shown below) such that both connections are in congestion avoidance from the beginning. > Have you verified that this patch fixes the issue you identified? I > think the fix may be in the wrong place for that scenario, or at least > incomplete. Yes, we have tested this and verified that the patch solves the issue in the test scenarios we've run. As always (in networking) it can be difficult to reproduce the issue in a consistent manner, so I will describe the setup we have used. In our testbed we have a sender host, a bridge and a receiver. The bridge is configured with 75 ms delay in each direction, giving an RTT of 150 ms. On bridge: tc qdisc add dev eth0 handle 1:0 root netem delay 75ms tc qdisc add dev eth1 handle 1:0 root netem delay 75ms We have used the tool streamzero (https://bitbucket.org/bendikro/streamzero) to produce the thin stream traffic. An example where the sender opens a TCP connection and performs write calls to the socket every 10 ms with 100 bytes per call for 60 seconds: On receiver host (10.0.0.22): ./streamzero_srv -p 5010 -A On sender host: ./streamzero_client -s 10.0.0.22 -p 5010 -v3 -A4 -d 60 -I i:10,S:100 streamzero_client will by default disable Nagle (TCP_NODELAY=1). Adding loss for a few seconds in netem causes the CWND and ssthresh to eventually drop to a low value, e.g. 2. On bridge: tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay 75ms loss 4% Removing loss after a few seconds: tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay 75ms >From this point and on the CWND will not grow any further, as shown by the >print statement when applying this patch: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d0ad355..947eb57 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2135,6 +2135,8 @@ repair: break; } + printk("CWND: %d, SSTHRESH: %d, PKTS_OUT: %d\n", tp->snd_cwnd, tp->snd_ssthresh, tp->packets_out); + if (likely(sent_pkts)) { if (tcp_in_cwnd_reduction(sk)) tp->prr_out += sent_pkts; The following patch avoids the need to induce loss by setting the connection in congestion avoidance from the beginning: diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a62e9c7..206b32f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5394,6 +5394,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) tcp_init_metrics(sk); + tp->snd_cwnd = tp->snd_ssthresh = 2; tcp_init_congestion_control(sk); /* Prevent spurious tcp_cwnd_restart() on first data The traces linked to above are run with this patch applied. > In the scenario you describe, you say "all the buffered data in the > output queue was sent within the current CWND", which means that there > is no unsent data, so in tcp_write_xmit() the call to tcp_send_head() > would return NULL, so we would not enter the while loop, and could not > set is_cwnd_limited to true. I'll describe two example scenarios in detail. In both scenarios we are in congestion avoidance after experiencing loss. Nagle is disabled. Scenario 1: Current CWND is 2 and there is currently one packet in flight. The output queue contains one SKB with unsent data (payload <= 1 * MSS). In tcp_write_xmit(), tcp_send_head() will return the SKB with unsent data, and tcp_cwnd_test() will set cwnd_quota to 1, which means that is_cwnd_limited will remain false. After sending the packet, sent_pkts will be set to 1, and the while will break because tcp_send_head() returns NULL. tcp_cwnd_validate() will be called with is_cwnd_limited=false. When a new data chunk is put on the output queue, and tcp_write_xmit is called, tcp_send_head() will return the SKB with unsent data. tcp_cwnd_test() will now set cwnd_quota to 0 because packets in flight == CWND. is_cwnd_limited will correctly be set to true, however, since no packets were sent, tcp_cwnd_validate() will never be called. (Calling tcp_cwnd_validate() if sent_pkts == 0 will not solve the problem in this scenario as it will not enter the first if test in tcp_cwnd_validate() where tp->is_cwnd_limited is updated.) Scenario 2: CWND is 2 and there is currently one packet in flight. The output
[PATCH net-next] tcp: Fix CWV being too strict on thin streams
Application limited streams such as thin streams, that transmit small amounts of payload in relatively few packets per RTT, are prevented from growing the CWND after experiencing loss. This leads to increased sojourn times for data segments in streams that often transmit time-dependent data. After the CWND is reduced due to loss, and an ACK has made room in the send window for more packets to be transmitted, the CWND will not grow unless there is more unsent data buffered in the output queue than the CWND permits to be sent. That is because tcp_cwnd_validate(), which updates tp->is_cwnd_limited, is only called in tcp_write_xmit() when at least one packet with new data has been sent. However, if all the buffered data in the output queue was sent within the current CWND, is_cwnd_limited will remain false even when there is no more room in the CWND. While the CWND is fully utilized, any new data put on the output queue will be held back (i.e. limited by the CWND), but tp->is_cwnd_limited will not be updated as no packets were transmitted. Fix by updating tp->is_cwnd_limited if no packets are sent due to the CWND being fully utilized. Cc: Andreas Petlund <apetl...@simula.no> Cc: Carsten Griwodz <gr...@simula.no> Cc: Jonas Markussen <jona...@ifi.uio.no> Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com> --- net/ipv4/tcp_output.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d0ad355..7096b8b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2145,6 +2145,11 @@ repair: tcp_cwnd_validate(sk, is_cwnd_limited); return false; } + /* We are prevented from transmitting because we are limited +* by the CWND, so update tcp sock with correct value. +*/ + if (is_cwnd_limited) + tp->is_cwnd_limited = is_cwnd_limited; return !tp->packets_out && tcp_send_head(sk); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html