Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-08 Thread Bendik Rønning Opstad
Eric, thank you for the feedback!

On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet  wrote:
> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet 
wrote:
>>> Really this looks very complicated.
>>
>> Can you be more specific?
>
> A lot of code added, needing maintenance cost for years to come.

Yes, that is understandable.

>>> Why not simply append the new skb content to prior one ?
>>
>> It's not clear to me what you mean. At what stage in the output engine
>> do you refer to?
>>
>> We want to avoid modifying the data of the SKBs in the output queue,
>
> Why ? We already do that, as I pointed out.

I suspect that we might be talking past each other. It wasn't clear to
me that we were discussing how to implement this in a different way.

The current retrans collapse functionality only merges SKBs that
contain data that has already been sent and is about to be
retransmitted.

This differs significantly from RDB, which combines both already
transmitted data and unsent data in the same packet without changing
how the data is stored (and the state tracked) in the output queue.
Another difference is that RDB includes un-ACKed data that is not
considered lost.


>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
>> The header and payload of the first SKB containing data we want to
>> redundantly transmit is then copied. Then the payload of the SKBs following
>> next in the output queue is appended onto the rdb_skb. The last payload
>> that is appended is from the first SKB with unsent data, i.e. the
>> sk_send_head.
>>
>> Would you suggest a different approach?
>>
>>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
>>> really available (ie its clone not sitting/waiting in a qdisc on the
>>> host)
>>
>> Where do you suggest this should be used?
>
> To detect if appending data to prior skb is possible.

I see. As the implementation intentionally avoids modifying SKBs in
the output queue, this was not obvious.

> If the prior packet is still in qdisc, no change is allowed,
> and it is fine : DRB should not trigger anyway.

Actually, whether the data in the prior SKB is on the wire or is still
on the host (in qdisc/driver queue) is not relevant. RDB always wants
to redundantly resend the data if there is room in the packet, because
the previous packet may become lost.

>>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
>>> MAX_TCP_HEADER) available bytes in skb->data.
>>
>> Sure, rdb_build_skb() could use this instead of the calculated
>> bytes_in_rdb_skb.
>
> Point is : small packets already have tail room in skb->head

Yes, I'm aware of that. But we do not allocate new SKBs because we
think the existing SKBs do not have enough space available. We do it
to avoid modifications to the SKBs in the output queue.

> When RDB decides a packet should be merged into the prior one, you can
> simply copy payload into the tailroom, then free the skb.
>
> No skb allocations are needed, only freeing.

It wasn't clear to me that you suggest a completely different
implementation approach altogether.

As I understand you, the approach you suggest is as follows:

1. An SKB containing unsent data is processed for transmission (lets
   call it T_SKB)
2. Check if the previous SKB (lets call it P_SKB) (containing sent but
   un-ACKed data) has available (tail) room for the payload contained
   in T_SKB.
3. If room in P_SKB:
  * Copy the unsent data from T_SKB to P_SKB by appending it to the
linear data and update sequence numbers.
  * Remove T_SKB (which contains only the new and unsent data) from
the output queue.
  * Transmit P_SKB, which now contains some already sent data and some
unsent data.


If I have misunderstood, can you please elaborate in detail what you
mean?

If this is the approach you suggest, I can think of some potential
downsides that require further considerations:


1) ACK-accounting will work differently

When the previous SKB (P_SKB) is modified by appending the data of the
next SKB (T_SKB), what should happen when an incoming ACK acknowledges
the data that was sent in the original transmission (before the SKB
was modified), but not the data that was appended later?
tcp_clean_rtx_queue currently handles partially ACKed SKBs due to TSO,
in which case the tcp_skb_pcount(skb) > 1. So this function would need
to be modified to handle this for RDB modified SKBs in the queue,
where all the data is located in the linear data buffer (no GSO segs).

How should SACK and retrans flags be handled when one SKB in the
output queue can represent multiple transmitted packets?


2) Timestamps and RTT measurements

How should RTT measurements work when you don't have a timestamp for
the data that was newly appended to the existing SKB containing sent
but un-ACKed data? Or should the skb->skb_mstamp be updated when the
SKB with newly appended data is sent again? That would make any RTT
meas

Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-08 Thread Bendik Rønning Opstad
Sorry guys, I messed up that email by including HTML, and it got
rejected by netdev@vger.kernel.org. I'll resend it properly formatted.

Bendik

On 08/02/16 18:17, Bendik Rønning Opstad wrote:
> Eric, thank you for the feedback!
> 
> On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet  wrote:
>> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
>>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet 
> wrote:
 Really this looks very complicated.
>>>
>>> Can you be more specific?
>>
>> A lot of code added, needing maintenance cost for years to come.
> 
> Yes, that is understandable.
> 
 Why not simply append the new skb content to prior one ?
>>>
>>> It's not clear to me what you mean. At what stage in the output engine
>>> do you refer to?
>>>
>>> We want to avoid modifying the data of the SKBs in the output queue,
>>
>> Why ? We already do that, as I pointed out.
> 
> I suspect that we might be talking past each other. It wasn't clear to
> me that we were discussing how to implement this in a different way.
> 
> The current retrans collapse functionality only merges SKBs that
> contain data that has already been sent and is about to be
> retransmitted.
> 
> This differs significantly from RDB, which combines both already
> transmitted data and unsent data in the same packet without changing
> how the data is stored (and the state tracked) in the output queue.
> Another difference is that RDB includes un-ACKed data that is not
> considered lost.
> 
>>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
>>> The header and payload of the first SKB containing data we want to
>>> redundantly transmit is then copied. Then the payload of the SKBs
> following
>>> next in the output queue is appended onto the rdb_skb. The last payload
>>> that is appended is from the first SKB with unsent data, i.e. the
>>> sk_send_head.
>>>
>>> Would you suggest a different approach?
>>>
 skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb
> is
 really available (ie its clone not sitting/waiting in a qdisc on the
 host)
>>>
>>> Where do you suggest this should be used?
>>
>> To detect if appending data to prior skb is possible.
> 
> I see. As the implementation intentionally avoids modifying SKBs in
> the output queue, this was not obvious.
> 
>> If the prior packet is still in qdisc, no change is allowed,
>> and it is fine : DRB should not trigger anyway.
> 
> Actually, whether the data in the prior SKB is on the wire or is still
> on the host (in qdisc/driver queue) is not relevant. RDB always wants
> to redundantly resend the data if there is room in the packet, because
> the previous packet may become lost.
> 
 Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
 MAX_TCP_HEADER) available bytes in skb->data.
>>>
>>> Sure, rdb_build_skb() could use this instead of the calculated
>>> bytes_in_rdb_skb.
>>
>> Point is : small packets already have tail room in skb->head
> 
> Yes, I'm aware of that. But we do not allocate new SKBs because we
> think the existing SKBs do not have enough space available. We do it
> to avoid modifications to the SKBs in the output queue.
> 
>> When RDB decides a packet should be merged into the prior one, you can
>> simply copy payload into the tailroom, then free the skb.
>>
>> No skb allocations are needed, only freeing.
> 
> It wasn't clear to me that you suggest a completely different
> implementation approach altogether.
> 
> As I understand you, the approach you suggest is as follows:
> 
> 1. An SKB containing unsent data is processed for transmission (lets
>call it T_SKB)
> 2. Check if the previous SKB (lets call it P_SKB) (containing sent but
>un-ACKed data) has available (tail) room for the payload contained
>in T_SKB.
> 3. If room in P_SKB:
>   * Copy the unsent data from T_SKB to P_SKB by appending it to the
> linear data and update sequence numbers.
>   * Remove T_SKB (which contains only the new and unsent data) from
> the output queue.
>   * Transmit P_SKB, which now contains some already sent data and some
> unsent data.
> 
> 
> If I have misunderstood, can you please elaborate in detail what you
> mean?
> 
> If this is the approach you suggest, I can think of some potential
> downsides that require further considerations:
> 
> 
> 1) ACK-accounting will work differently
> 
> When the previous SKB (P_SKB) is modified by appending the data of the
> next SKB (T_SKB), what should happen when an incoming ACK
> acknowledges the data that was sent in the original transmission
> (before the SKB was modified), but not the data that was appended
> later? tcp_clean_rtx_queue currently handles partially ACKed SKBs due
> to TSO, in which case the tcp_skb_pcount(skb) > 1. So this function
> would need to be modified to handle this for RDB modified SKBs in the
> queue, where all the data is located in the linear data buffer (no GSO
> segs).
> 
> How should SACK and retrans flags be handled wh

Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-03 Thread Eric Dumazet
On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet  wrote:
> > On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
> >>
> >>   o When packets are scheduled for transmission, RDB replaces the SKB to
> >> be sent with a modified SKB containing the redundant data of
> >> previously sent data segments from the TCP output queue.
> >
> > Really this looks very complicated.
> 
> Can you be more specific?

A lot of code added, needing maintenance cost for years to come.

> 
> > Why not simply append the new skb content to prior one ?
> 
> It's not clear to me what you mean. At what stage in the output engine
> do you refer to?
> 
> We want to avoid modifying the data of the SKBs in the output queue,

Why ? We already do that, as I pointed out.

> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
> The header and payload of the first SKB containing data we want to
> redundantly transmit is then copied. Then the payload of the SKBs following
> next in the output queue is appended onto the rdb_skb. The last payload
> that is appended is from the first SKB with unsent data, i.e. the
> sk_send_head.
> 
> Would you suggest a different approach?
> 
> > skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
> > really available (ie its clone not sitting/waiting in a qdisc on the
> > host)
> 
> Where do you suggest this should be used?

To detect if appending data to prior skb is possible.

If the prior packet is still in qdisc, no change is allowed,
and it is fine : DRB should not trigger anyway.

> 
> > Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
> > MAX_TCP_HEADER) available bytes in skb->data.
> 
> Sure, rdb_build_skb() could use this instead of the calculated
> bytes_in_rdb_skb.

Point is : small packets already have tail room in skb->head

When RDB decides a packet should be merged into the prior one, you can
simply copy payload into the tailroom, then free the skb.

No skb allocations are needed, only freeing.

RDB could be implemented in a more concise way.





Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-03 Thread Bendik Rønning Opstad
On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet  wrote:
> On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
>>
>>   o When packets are scheduled for transmission, RDB replaces the SKB to
>> be sent with a modified SKB containing the redundant data of
>> previously sent data segments from the TCP output queue.
>
> Really this looks very complicated.

Can you be more specific?

> Why not simply append the new skb content to prior one ?

It's not clear to me what you mean. At what stage in the output engine
do you refer to?

We want to avoid modifying the data of the SKBs in the output queue,
therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
The header and payload of the first SKB containing data we want to
redundantly transmit is then copied. Then the payload of the SKBs following
next in the output queue is appended onto the rdb_skb. The last payload
that is appended is from the first SKB with unsent data, i.e. the
sk_send_head.

Would you suggest a different approach?

> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
> really available (ie its clone not sitting/waiting in a qdisc on the
> host)

Where do you suggest this should be used?

> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
> MAX_TCP_HEADER) available bytes in skb->data.

Sure, rdb_build_skb() could use this instead of the calculated
bytes_in_rdb_skb.

> Also note that tcp_collapse_retrans() is very similar to your needs. You
> might simply expand it.

The functionality shared is the copying of data from one SKB to another, as
well as adjusting sequence numbers and checksum. Unlinking SKBs from the
output queue, modifying the data of SKBs in the output queue, and changing
retrans hints is not shared.

To reduce code duplication, the function skb_append_data in tcp_rdb.c could
be moved to tcp_output.c, and then be called from tcp_collapse_retrans.

Is it something like this you had in mind?


Bendik


Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-02 Thread Eric Dumazet
On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.
> 
> The main functionality added:
> 
>   o Loss detection of hidden loss events: When bundling redundant data
> with each packet, packet loss can be hidden from the TCP engine due
> to lack of dupACKs. This is because the loss is "repaired" by the
> redundant data in the packet coming after the lost packet. Based on
> incoming ACKs, such hidden loss events are detected, and CWR state
> is entered.
> 
>   o When packets are scheduled for transmission, RDB replaces the SKB to
> be sent with a modified SKB containing the redundant data of
> previously sent data segments from the TCP output queue.

Really this looks very complicated.

Why not simply append the new skb content to prior one ?

skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
really available (ie its clone not sitting/waiting in a qdisc on the
host)

Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
MAX_TCP_HEADER) available bytes in skb->data.

Also note that tcp_collapse_retrans() is very similar to your needs. You
might simply expand it.





[PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-02 Thread Bendik Rønning Opstad
RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
be sent with a modified SKB containing the redundant data of
previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
for streams that may benefit from RDB, controlled by the sysctl
variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund 
Cc: Carsten Griwodz 
Cc: Pål Halvorsen 
Cc: Jonas Markussen 
Cc: Kristian Evensen 
Cc: Kenneth Klette Jonassen 
Signed-off-by: Bendik Rønning Opstad 
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  14 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  26 
 net/ipv4/tcp.c |  14 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 273 +
 12 files changed, 359 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index eb42853..14f960d 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_packets - INTEGER
+   Enable restriction on how many previous packets in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c..eb81877 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2914,6 +2914,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int 
flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b386361..da6dae8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
} rack;
u16 advmss; /* Advertised MSS   */
u8  unused;
-   u8  nonagle : 4,/* Disable Nagle algorithm? */
+   u8  nonagle : 3,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
+   rdb : 1,/* Redundant Data Bundling enabled  */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2d86bd7..bc6e81b 100644
--- a/include/net/tcp.h
+++ b/inclu