Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Eric, thank you for the feedback! On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet wrote: > On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote: >> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet wrote: >>> Really this looks very complicated. >> >> Can you be more specific? > > A lot of code added, needing maintenance cost for years to come. Yes, that is understandable. >>> Why not simply append the new skb content to prior one ? >> >> It's not clear to me what you mean. At what stage in the output engine >> do you refer to? >> >> We want to avoid modifying the data of the SKBs in the output queue, > > Why ? We already do that, as I pointed out. I suspect that we might be talking past each other. It wasn't clear to me that we were discussing how to implement this in a different way. The current retrans collapse functionality only merges SKBs that contain data that has already been sent and is about to be retransmitted. This differs significantly from RDB, which combines both already transmitted data and unsent data in the same packet without changing how the data is stored (and the state tracked) in the output queue. Another difference is that RDB includes un-ACKed data that is not considered lost. >> therefore we allocate a new SKB (This SKB is named rdb_skb in the code). >> The header and payload of the first SKB containing data we want to >> redundantly transmit is then copied. Then the payload of the SKBs following >> next in the output queue is appended onto the rdb_skb. The last payload >> that is appended is from the first SKB with unsent data, i.e. the >> sk_send_head. >> >> Would you suggest a different approach? >> >>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is >>> really available (ie its clone not sitting/waiting in a qdisc on the >>> host) >> >> Where do you suggest this should be used? > > To detect if appending data to prior skb is possible. I see. As the implementation intentionally avoids modifying SKBs in the output queue, this was not obvious. > If the prior packet is still in qdisc, no change is allowed, > and it is fine : DRB should not trigger anyway. Actually, whether the data in the prior SKB is on the wire or is still on the host (in qdisc/driver queue) is not relevant. RDB always wants to redundantly resend the data if there is room in the packet, because the previous packet may become lost. >>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - >>> MAX_TCP_HEADER) available bytes in skb->data. >> >> Sure, rdb_build_skb() could use this instead of the calculated >> bytes_in_rdb_skb. > > Point is : small packets already have tail room in skb->head Yes, I'm aware of that. But we do not allocate new SKBs because we think the existing SKBs do not have enough space available. We do it to avoid modifications to the SKBs in the output queue. > When RDB decides a packet should be merged into the prior one, you can > simply copy payload into the tailroom, then free the skb. > > No skb allocations are needed, only freeing. It wasn't clear to me that you suggest a completely different implementation approach altogether. As I understand you, the approach you suggest is as follows: 1. An SKB containing unsent data is processed for transmission (lets call it T_SKB) 2. Check if the previous SKB (lets call it P_SKB) (containing sent but un-ACKed data) has available (tail) room for the payload contained in T_SKB. 3. If room in P_SKB: * Copy the unsent data from T_SKB to P_SKB by appending it to the linear data and update sequence numbers. * Remove T_SKB (which contains only the new and unsent data) from the output queue. * Transmit P_SKB, which now contains some already sent data and some unsent data. If I have misunderstood, can you please elaborate in detail what you mean? If this is the approach you suggest, I can think of some potential downsides that require further considerations: 1) ACK-accounting will work differently When the previous SKB (P_SKB) is modified by appending the data of the next SKB (T_SKB), what should happen when an incoming ACK acknowledges the data that was sent in the original transmission (before the SKB was modified), but not the data that was appended later? tcp_clean_rtx_queue currently handles partially ACKed SKBs due to TSO, in which case the tcp_skb_pcount(skb) > 1. So this function would need to be modified to handle this for RDB modified SKBs in the queue, where all the data is located in the linear data buffer (no GSO segs). How should SACK and retrans flags be handled when one SKB in the output queue can represent multiple transmitted packets? 2) Timestamps and RTT measurements How should RTT measurements work when you don't have a timestamp for the data that was newly appended to the existing SKB containing sent but un-ACKed data? Or should the skb->skb_mstamp be updated when the SKB with newly appended data is sent again? That would make any RTT meas
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
Sorry guys, I messed up that email by including HTML, and it got rejected by netdev@vger.kernel.org. I'll resend it properly formatted. Bendik On 08/02/16 18:17, Bendik Rønning Opstad wrote: > Eric, thank you for the feedback! > > On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet wrote: >> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote: >>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet > wrote: Really this looks very complicated. >>> >>> Can you be more specific? >> >> A lot of code added, needing maintenance cost for years to come. > > Yes, that is understandable. > Why not simply append the new skb content to prior one ? >>> >>> It's not clear to me what you mean. At what stage in the output engine >>> do you refer to? >>> >>> We want to avoid modifying the data of the SKBs in the output queue, >> >> Why ? We already do that, as I pointed out. > > I suspect that we might be talking past each other. It wasn't clear to > me that we were discussing how to implement this in a different way. > > The current retrans collapse functionality only merges SKBs that > contain data that has already been sent and is about to be > retransmitted. > > This differs significantly from RDB, which combines both already > transmitted data and unsent data in the same packet without changing > how the data is stored (and the state tracked) in the output queue. > Another difference is that RDB includes un-ACKed data that is not > considered lost. > >>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code). >>> The header and payload of the first SKB containing data we want to >>> redundantly transmit is then copied. Then the payload of the SKBs > following >>> next in the output queue is appended onto the rdb_skb. The last payload >>> that is appended is from the first SKB with unsent data, i.e. the >>> sk_send_head. >>> >>> Would you suggest a different approach? >>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb > is really available (ie its clone not sitting/waiting in a qdisc on the host) >>> >>> Where do you suggest this should be used? >> >> To detect if appending data to prior skb is possible. > > I see. As the implementation intentionally avoids modifying SKBs in > the output queue, this was not obvious. > >> If the prior packet is still in qdisc, no change is allowed, >> and it is fine : DRB should not trigger anyway. > > Actually, whether the data in the prior SKB is on the wire or is still > on the host (in qdisc/driver queue) is not relevant. RDB always wants > to redundantly resend the data if there is room in the packet, because > the previous packet may become lost. > Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER) available bytes in skb->data. >>> >>> Sure, rdb_build_skb() could use this instead of the calculated >>> bytes_in_rdb_skb. >> >> Point is : small packets already have tail room in skb->head > > Yes, I'm aware of that. But we do not allocate new SKBs because we > think the existing SKBs do not have enough space available. We do it > to avoid modifications to the SKBs in the output queue. > >> When RDB decides a packet should be merged into the prior one, you can >> simply copy payload into the tailroom, then free the skb. >> >> No skb allocations are needed, only freeing. > > It wasn't clear to me that you suggest a completely different > implementation approach altogether. > > As I understand you, the approach you suggest is as follows: > > 1. An SKB containing unsent data is processed for transmission (lets >call it T_SKB) > 2. Check if the previous SKB (lets call it P_SKB) (containing sent but >un-ACKed data) has available (tail) room for the payload contained >in T_SKB. > 3. If room in P_SKB: > * Copy the unsent data from T_SKB to P_SKB by appending it to the > linear data and update sequence numbers. > * Remove T_SKB (which contains only the new and unsent data) from > the output queue. > * Transmit P_SKB, which now contains some already sent data and some > unsent data. > > > If I have misunderstood, can you please elaborate in detail what you > mean? > > If this is the approach you suggest, I can think of some potential > downsides that require further considerations: > > > 1) ACK-accounting will work differently > > When the previous SKB (P_SKB) is modified by appending the data of the > next SKB (T_SKB), what should happen when an incoming ACK > acknowledges the data that was sent in the original transmission > (before the SKB was modified), but not the data that was appended > later? tcp_clean_rtx_queue currently handles partially ACKed SKBs due > to TSO, in which case the tcp_skb_pcount(skb) > 1. So this function > would need to be modified to handle this for RDB modified SKBs in the > queue, where all the data is located in the linear data buffer (no GSO > segs). > > How should SACK and retrans flags be handled wh
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote: > On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet wrote: > > On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote: > >> > >> o When packets are scheduled for transmission, RDB replaces the SKB to > >> be sent with a modified SKB containing the redundant data of > >> previously sent data segments from the TCP output queue. > > > > Really this looks very complicated. > > Can you be more specific? A lot of code added, needing maintenance cost for years to come. > > > Why not simply append the new skb content to prior one ? > > It's not clear to me what you mean. At what stage in the output engine > do you refer to? > > We want to avoid modifying the data of the SKBs in the output queue, Why ? We already do that, as I pointed out. > therefore we allocate a new SKB (This SKB is named rdb_skb in the code). > The header and payload of the first SKB containing data we want to > redundantly transmit is then copied. Then the payload of the SKBs following > next in the output queue is appended onto the rdb_skb. The last payload > that is appended is from the first SKB with unsent data, i.e. the > sk_send_head. > > Would you suggest a different approach? > > > skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is > > really available (ie its clone not sitting/waiting in a qdisc on the > > host) > > Where do you suggest this should be used? To detect if appending data to prior skb is possible. If the prior packet is still in qdisc, no change is allowed, and it is fine : DRB should not trigger anyway. > > > Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - > > MAX_TCP_HEADER) available bytes in skb->data. > > Sure, rdb_build_skb() could use this instead of the calculated > bytes_in_rdb_skb. Point is : small packets already have tail room in skb->head When RDB decides a packet should be merged into the prior one, you can simply copy payload into the tailroom, then free the skb. No skb allocations are needed, only freeing. RDB could be implemented in a more concise way.
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet wrote: > On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote: >> >> o When packets are scheduled for transmission, RDB replaces the SKB to >> be sent with a modified SKB containing the redundant data of >> previously sent data segments from the TCP output queue. > > Really this looks very complicated. Can you be more specific? > Why not simply append the new skb content to prior one ? It's not clear to me what you mean. At what stage in the output engine do you refer to? We want to avoid modifying the data of the SKBs in the output queue, therefore we allocate a new SKB (This SKB is named rdb_skb in the code). The header and payload of the first SKB containing data we want to redundantly transmit is then copied. Then the payload of the SKBs following next in the output queue is appended onto the rdb_skb. The last payload that is appended is from the first SKB with unsent data, i.e. the sk_send_head. Would you suggest a different approach? > skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is > really available (ie its clone not sitting/waiting in a qdisc on the > host) Where do you suggest this should be used? > Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - > MAX_TCP_HEADER) available bytes in skb->data. Sure, rdb_build_skb() could use this instead of the calculated bytes_in_rdb_skb. > Also note that tcp_collapse_retrans() is very similar to your needs. You > might simply expand it. The functionality shared is the copying of data from one SKB to another, as well as adjusting sequence numbers and checksum. Unlinking SKBs from the output queue, modifying the data of SKBs in the output queue, and changing retrans hints is not shared. To reduce code duplication, the function skb_append_data in tcp_rdb.c could be moved to tcp_output.c, and then be called from tcp_collapse_retrans. Is it something like this you had in mind? Bendik
Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote: > RDB is a mechanism that enables a TCP sender to bundle redundant > (already sent) data with TCP packets containing new data. By bundling > (retransmitting) already sent data with each TCP packet containing new > data, the connection will be more resistant to sporadic packet loss > which reduces the application layer latency significantly in congested > scenarios. > > The main functionality added: > > o Loss detection of hidden loss events: When bundling redundant data > with each packet, packet loss can be hidden from the TCP engine due > to lack of dupACKs. This is because the loss is "repaired" by the > redundant data in the packet coming after the lost packet. Based on > incoming ACKs, such hidden loss events are detected, and CWR state > is entered. > > o When packets are scheduled for transmission, RDB replaces the SKB to > be sent with a modified SKB containing the redundant data of > previously sent data segments from the TCP output queue. Really this looks very complicated. Why not simply append the new skb content to prior one ? skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is really available (ie its clone not sitting/waiting in a qdisc on the host) Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER) available bytes in skb->data. Also note that tcp_collapse_retrans() is very similar to your needs. You might simply expand it.
[PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
RDB is a mechanism that enables a TCP sender to bundle redundant (already sent) data with TCP packets containing new data. By bundling (retransmitting) already sent data with each TCP packet containing new data, the connection will be more resistant to sporadic packet loss which reduces the application layer latency significantly in congested scenarios. The main functionality added: o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. o When packets are scheduled for transmission, RDB replaces the SKB to be sent with a modified SKB containing the redundant data of previously sent data segments from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable tcp_thin_dpifl_itt_lower_bound. RDB is enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable tcp_rdb=1. Cc: Andreas Petlund Cc: Carsten Griwodz Cc: Pål Halvorsen Cc: Jonas Markussen Cc: Kristian Evensen Cc: Kenneth Klette Jonassen Signed-off-by: Bendik Rønning Opstad --- Documentation/networking/ip-sysctl.txt | 15 ++ include/linux/skbuff.h | 1 + include/linux/tcp.h| 3 +- include/net/tcp.h | 14 ++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 3 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 26 net/ipv4/tcp.c | 14 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 11 +- net/ipv4/tcp_rdb.c | 273 + 12 files changed, 359 insertions(+), 8 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index eb42853..14f960d 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 1 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_packets - INTEGER + Enable restriction on how many previous packets in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 11f935c..eb81877 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2914,6 +2914,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index b386361..da6dae8 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -202,9 +202,10 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 unused; - u8 nonagle : 4,/* Disable Nagle algorithm? */ + u8 nonagle : 3,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ + rdb : 1,/* Redundant Data Bundling enabled */ repair : 1, frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; diff --git a/include/net/tcp.h b/include/net/tcp.h index 2d86bd7..bc6e81b 100644 --- a/include/net/tcp.h +++ b/inclu