from:"Bendik Rønning Opstad"

[PATCH v7 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-06-22 Thread Bendik Rønning Opstad


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

These patches have also been tested with a set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf


Changes:

v7 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed sysctl_tcp_rdb to accept three values (Thanks Yuchung):
 - 0: Disable system wide (RDB cannot be enabled with TCP_RDB socket option)
 - 1: Allow enabling RDB with TCP_RDB socket option.
 - 2: Enable RDB by default on all TCP sockets and allow to modify with 
TCP_RDB
   * Added sysctl tcp_rdb_wait_congestion to control if RDB by default should
 wait for congestion before bundling. (Ref. comment by Eric on lossless 
conns)
   * Changed socket options to modify per-socket RDB settings:
 - Added flags to TCP_RDB to allow bundling without waiting for loss with
   TCP_RDB_BUNDLE_IMMEDIATE.
 - Added socket option TCP_RDB_MAX_BYTES: Set max bytes per RDB packet.
 - Added socket option TCP_RDB_MAX_PACKETS: Set max packets allowed to be
   bundled by RDB.
   * Added SNMP counter LINUX_MIB_TCPRDBLOSSREPAIRS to count the occurences
 where RDB repaired a loss (Thanks Eric).
   * Bundle on FIN packets (Thanks Yuchung).
   * Updated docs in Documentation/networking/{ip-sysctl.txt,tcp-thin.txt}
   * Removed flags parameter from tcp_rdb_ack_event()
   * Changed sysctl knobs to using network namespace.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Changed sysctl knobs to using network namespace

v6 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM)
   * Minor doc changes

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Minor doc changes

v5 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
   * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
   * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
   * Merged the two if tests for max payload of RDB packet in
 rdb_can_bundle_test()
   * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
 and restructured to reduce indentation.
   * Improved docs
   * Revised commit message to be more detailed.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
 function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
 tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
 tcp_rdb_max_packets after comment from Eric Dumazet about
 not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
 feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
 to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  43 ++
 Documentation/netw

[PATCH v7 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-06-22 Thread Bendik Rønning Opstad

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.

RDB enables a TCP sender to bundle redundant (already sent) data with
TCP packets containing small segments of new data. By resending
un-ACKed data from the output queue in packets with new data, RDB
reduces the need to retransmit data segments on connections
experiencing sporadic packet loss. By avoiding a retransmit, RDB
evades the latency increase of at least one RTT for the lost packet,
as well as alleviating head-of-line blocking for the packets following
the lost packet. This makes the TCP connection more resistant to
latency fluctuations, and reduces the application layer latency
significantly in lossy environments.

Main functionality added:

  o When a packet is scheduled for transmission, RDB builds and
transmits a new SKB containing both the unsent data as well as
data of previously sent packets from the TCP output queue.

  o RDB will only be used for streams classified as thin by the
function tcp_stream_is_thin_dpifl(). This enforces a lower bound
on the ITT for streams that may benefit from RDB, controlled by
the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

RDB can be enabled on a connection with the socket option TCP_RDB or
on all new connections by setting the sysctl variable
net.ipv4.tcp_rdb=2

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  35 +
 Documentation/networking/tcp-thin.txt  | 188 --
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|  11 +-
 include/net/netns/ipv4.h   |   5 +
 include/net/tcp.h  |  12 ++
 include/uapi/linux/snmp.h  |   1 +
 include/uapi/linux/tcp.h   |  10 ++
 net/core/skbuff.c  |   2 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/proc.c|   1 +
 net/ipv4/sysctl_net_ipv4.c |  34 +
 net/ipv4/tcp.c |  42 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_ipv4.c|   5 +
 net/ipv4/tcp_output.c  |  49 ---
 net/ipv4/tcp_rdb.c | 240 +
 17 files changed, 579 insertions(+), 63 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index d856b98..d26d12b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -726,6 +726,41 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Controls the use of the Redundant Data Bundling (RDB) mechanism
+   for TCP connections.
+
+   RDB is a TCP mechanism aimed at reducing the latency for
+   applications transmitting time-dependent data. By bundling already
+   sent data in packets with new data, RDB alleviates head-of-line
+   blocking on the receiver side by reducing the need to retransmit
+   data segments when packets are lost. See tcp-thin.txt for further
+   details.
+   Possible values:
+   0 - Disable RDB system wide, i.e. disallow enabling RDB on TCP
+   sockets with the socket option TCP_RDB.
+   1 - Allow enabling/disabling RDB with socket option TCP_RDB.
+   2 - Set RDB to be enabled by default for all new TCP connections
+   and allow modifying socket with socket option TCP_RDB.
+   Default: 1
+

[PATCH v7 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2016-06-22 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams,
tcp_stream_is_thin(), is based on a static limit of less than 4
packets in flight. This treats streams differently depending on the
connection's RTT, such that a stream on a high RTT link may never be
considered thin, whereas the same application would produce a stream
that would always be thin in a low RTT scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin
stream detection will be independent of the RTT and treat streams
equally based on the transmission pattern, i.e. the inter-transmission
time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  | 21 +
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp_ipv4.c|  1 +
 5 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 9ae9293..d856b98 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -718,6 +718,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound inter-transmission time (ITT) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). Based on
+   this threshold, a dynamic packets in flight limit (DPIFL) is
+   calculated, which is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d061ffe..71be4ac 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -111,6 +111,7 @@ struct netns_ipv4 {
int sysctl_tcp_orphan_retries;
int sysctl_tcp_fin_timeout;
unsigned int sysctl_tcp_notsent_lowat;
+   int sysctl_tcp_thin_dpifl_itt_lower_bound;
 
int sysctl_igmp_max_memberships;
int sysctl_igmp_max_msf;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a79894b..9956af9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -214,6 +214,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND  10
@@ -1652,6 +1654,25 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Test if the stream is thin based on
+ *  dynamic PIF limit (DPIFL)
+ * @sk: socket
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct sock *sk)
+{
+   /* Calculate the maximum allowed PIF limit by dividing the RTT by
+* the minimum allowed inter-transmission time (ITT).
+* Tests if PIF < RTT / ITT-lower-bound
+*/
+   return (u64) tcp_packets_in_flight(tcp_sk(sk)) *
+   sock_net(sk)->ipv4.sysctl_tcp_thin_dpifl_itt_lower_bound <
+   (tcp_sk(sk)->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1cb67de..150969d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = 
TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -960,6 +961,14 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "tcp_thin_dpifl_itt_lower_bound"

Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-06-16 Thread Bendik Rønning Opstad

On 21/03/16 19:54, Yuchung Cheng wrote:
> On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad
> <bro.de...@gmail.com> wrote:
>>
>>>> diff --git a/Documentation/networking/ip-sysctl.txt 
>>>> b/Documentation/networking/ip-sysctl.txt
>>>> index 6a92b15..8f3f3bf 100644
>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
>>>> calculated, which is used to classify whether a stream is thin.
>>>> Default: 1
>>>>
>>>> +tcp_rdb - BOOLEAN
>>>> +   Enable RDB for all new TCP connections.
>>>   Please describe RDB briefly, perhaps with a pointer to your paper.
>>
>> Ah, yes, that description may have been a bit too brief...
>>
>> What about pointing to tcp-thin.txt in the brief description, and
>> rewrite tcp-thin.txt with a more detailed description of RDB along
>> with a paper reference?
> +1
>>
>>>I suggest have three level of controls:
>>>0: disable RDB completely
>>>1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
>>> options
>>>2: enable RDB on all thin-stream conn. by default
>>>
>>>currently it only provides mode 1 and 2. but there may be cases where
>>>the administrator wants to disallow it (e.g., broken middle-boxes).
>>
>> Good idea. Will change this.

I have implemented your suggestion in the next patch.

>>> It also seems better to
>>> allow individual socket to select the redundancy level (e.g.,
>>> setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
>>> This requires more bits in tcp_sock but 2-3 more is suffice.
>>
>> Most certainly. We decided not to implement this for the patch to keep
>> it as simple as possible, however, we surely prefer to have this
>> functionality included if possible.

Next patch version has a socket option to allow modifying the different
RDB settings.

>>>> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
>>>> +unsigned int mss_now, gfp_t gfp_mask)
>>>> +{
>>>> +   struct sk_buff *rdb_skb = NULL;
>>>> +   struct sk_buff *first_to_bundle;
>>>> +   u32 bytes_in_rdb_skb = 0;
>>>> +
>>>> +   /* How we detect that RDB was used. When equal, no RDB data was 
>>>> sent */
>>>> +   TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
>>>
>>>> +
>>>> +   if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
>>> During loss recovery tcp inflight fluctuates and would like to trigger
>>> this check even for non-thin-stream connections.
>>
>> Good point.
>>
>>> Since the loss
>>> already occurs, RDB can only take advantage from limited-transmit,
>>> which it likely does not have (b/c its a thin-stream). It might be
>>> checking if the state is open.
>>
>> You mean to test for open state to avoid calling rdb_can_bundle_test()
>> unnecessarily if we (presume to) know it cannot bundle anyway? That
>> makes sense, however, I would like to do some tests on whether "state
>> != open" is a good indicator on when bundling is not possible.

When testing this I found that bundling can often be performed when
not in Open state. For the most part in CWR mode, but also the other
modes, so this does not seem like a good indicator.

The only problem with tcp_stream_is_thin_dpifl() triggering for
non-thin streams in loss recovery would be the performance penalty of
calling rdb_can_bundle_test(). It would not be able to bundle anyways
since the previous SKB would contain >= mss worth of data.

The most reliable test is to check available space in the previous
SKB, i.e. if (xmit_skb->prev->len == mss_now). Do you suggest, for
performance reasons, to do this before the call to
tcp_stream_is_thin_dpifl()?

>>> since RDB will cause DSACKs, and we only blindly count DSACKs to
>>> perform CWND undo. How does RDB handle that false positives?
>>
>> That is a very good question. The simple answer is that the
>> implementation does not handle any such false positives, which I
>> expect can result in incorrectly undoing CWND reduction in some cases.
>> This gets a bit complicated, so I'll have to do some more testing on
>> this to verify with certainty when it happens.
>>
>> When there is no loss, and each RDB packet arriving at the receiver
>> contains both

Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-03-19 Thread Bendik Rønning Opstad

>> diff --git a/Documentation/networking/ip-sysctl.txt 
>> b/Documentation/networking/ip-sysctl.txt
>> index 6a92b15..8f3f3bf 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
>> calculated, which is used to classify whether a stream is thin.
>> Default: 1
>>
>> +tcp_rdb - BOOLEAN
>> +   Enable RDB for all new TCP connections.
>   Please describe RDB briefly, perhaps with a pointer to your paper.

Ah, yes, that description may have been a bit too brief...

What about pointing to tcp-thin.txt in the brief description, and
rewrite tcp-thin.txt with a more detailed description of RDB along
with a paper reference?

>I suggest have three level of controls:
>0: disable RDB completely
>1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
> options
>2: enable RDB on all thin-stream conn. by default
>
>currently it only provides mode 1 and 2. but there may be cases where
>the administrator wants to disallow it (e.g., broken middle-boxes).

Good idea. Will change this.

>> +   Default: 0
>> +
>> +tcp_rdb_max_bytes - INTEGER
>> +   Enable restriction on how many bytes an RDB packet can contain.
>> +   This is the total amount of payload including the new unsent data.
>> +   Default: 0
>> +
>> +tcp_rdb_max_packets - INTEGER
>> +   Enable restriction on how many previous packets in the output queue
>> +   RDB may include data from. A value of 1 will restrict bundling to
>> +   only the data from the last packet that was sent.
>> +   Default: 1
>  why two metrics on redundancy?

We have primarily used the packet based limit in our tests. This is
also the most important knob as it directly controls how many lost
packets each RDB packet may recover.

We believe that the byte based limit can also be useful because it
allows more fine grained control on how much impact RDB can have on
the increased bandwidth requirements of the flows. If an application
writes 700 bytes per write call, the bandwidth increase can be quite
significant (even with a 1 packet bundling limit) if we consider a
scenario with thousands of RDB streams.

In some of our experiments with many simultaneous thin streams, where
we set up a bottleneck rate limited by a htb with pfifo queue, we
observed considerable difference in loss rates depending on how many
bytes (packets) were allowed to be bundled with each packet. This is
partly why we recommend a default bundling limit of 1 packet.

By limiting the total payload size of RDB packets to e.g. 100 bytes,
only the smallest segments will benefit from RDB, while the segments
that would increase the bandwidth requirements the most, will not.

While a very large number of RDB streams from one sender may be a
corner case, we still think this sysctl knob can be valuable for a
sysadmin that finds himself in such a situation.

> It also seems better to
> allow individual socket to select the redundancy level (e.g.,
> setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
> This requires more bits in tcp_sock but 2-3 more is suffice.

Most certainly. We decided not to implement this for the patch to keep
it as simple as possible, however, we surely prefer to have this
functionality included if possible.

>> +static unsigned int rdb_detect_loss(struct sock *sk)
>> +{
...
>> +   tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
>> +   if (before(TCP_SKB_CB(skb)->seq, 
>> scb->tx.rdb_start_seq))
>> +   break;
>> +   packets_lost++;
> since we only care if there is packet loss or not, we can return early here?

Yes, I considered that, and as long as the number of packets presumed
to be lost is not needed, that will suffice. However, could this not
be useful for statistical purposes?

This is also relevant to the comment from Eric on SNMP counters for
how many times losses could be repaired by RDB?

>> +   }
>> +   break;
>> +   }
>> +   return packets_lost;
>> +}
>> +
>> +/**
>> + * tcp_rdb_ack_event() - initiate RDB loss detection
>> + * @sk: socket
>> + * @flags: flags
>> + */
>> +void tcp_rdb_ack_event(struct sock *sk, u32 flags)
> flags are not used

Ah, yes, will remove that.

>> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
>> +unsigned int mss_now, gfp_t gfp_mask)
>> +{
>> +   struct sk_buff *rdb_skb = NULL;
>> +   struct sk_buff *first_to_bundle;
>> +   u32 bytes_in_rdb_skb = 0;
>> +
>> +   /* How we detect that RDB was used. When equal, no RDB data was sent 
>> */
>> +   TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
>
>> +
>> +   if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
> During loss recovery tcp inflight fluctuates and would like to trigger
> this check even for non-thin-stream

Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-03-19 Thread Bendik Rønning Opstad

On 14/03/16 22:15, Eric Dumazet wrote:
> Acked-by: Eric Dumazet 
>
> Note that RDB probably should get some SNMP counters,
> so that we get an idea of how many times a loss could be repaired.

Good idea. Simply count how many times an RDB packet successfully
repaired loss? Note that this can be one or more lost packets. When
bundling N packets, the RDB packet can repair up to N losses in the
previous N packets that were sent.

Which list should this be added to? snmp4_tcp_list?

Any other counters that would be useful? Total number of RDB packets
transmitted?

> Ideally, if the path happens to be lossless, all these pro active
> bundles are overhead. Might be useful to make RDB conditional to
> tp->total_retrans or something.

Yes, that is a good point. We have discussed this (for years really),
but have not had the opportunity to investigate it in-depth. Having
such a condition hard coded is not ideal, as it very much depends on
the use case if bundling from the beginning is desirable. In most
cases, this is probably a fair compromise, but preferably we would
have some logic/settings to control how the bundling rate can be
dynamically adjusted in response to certain events, defined by a set
of given metrics.

A conservative (default) setting would not do bundling until loss has
been registered, and could also check against some smoothed loss
indicator such that a certain amount of loss must have occurred within
a specific time frame to allow bundling. This could be useful in cases
where the network congestion varies greatly depending on such as the
time of day/night.

In a scenario where minimal application layer latency is very
important, but only sporadic (single) packet loss is expected to
regularly occur, always bundling one previous packet may be both
sufficient and desirable.

In the end, the best settings for an application/service depends on
the degree to which application layer latency (both minimal and
variations) affects the QoE.

There are many possibilities to consider in this regard, and I expect
we will not have this question fully explored any time soon. Most
importantly, we should ensure that such logic can easily be added
later on without breaking backwards compatibility.

Suggestions and comments on this are very welcome.

Bendik

Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-03-18 Thread Bendik Rønning Opstad

On 14/03/16 22:59, Yuchung Cheng wrote:
> OK that makes sense.
>
> I left some detailed comments on the actual patches. I would encourage
> to submit an IETF draft to gather feedback from tcpm b/c the feature
> seems portable.

Thank you for the suggestion, we appreciate the confidence. We have
had in mind to eventually pursue a standardization process, but have
been unsure about how a mechanism that actively introduces redundancy
would be received by the IETF. It may now be the right time to propose
the RDB mechanism, and we will aim to present an IEFT draft in the
near future.

Bendik

Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-03-13 Thread Bendik Rønning Opstad

On 03/10/2016 01:20 AM, Yuchung Cheng wrote:
> I read the paper. I think the underlying idea is neat. but the
> implementation is little heavy-weight that requires changes on fast
> path (tcp_write_xmit) and space in skb control blocks.

Yuchung, thank you for taking the time to review the patch submission
and read the paper.

I must admit I was not particularly happy about the extra if-test on the
fast path, and I fully understand the wish to keep the fast path as
simple and clean as possible.
However, is the performance hit that significant considering the branch
prediction hint for the non-RDB path?

The extra variable needed in the SKB CB does not require increasing the
CB buffer size due to the "tcp: refactor struct tcp_skb_cb" patch:
http://patchwork.ozlabs.org/patch/510674 and uses only some of the space
made available in the outgoing SKBs' CB. Therefore I hoped the extra
variable would be acceptable.

> ultimately this
> patch is meant for a small set of specific applications.

Yes, the RDB mechanism is aimed at a limited set of applications,
specifically time-dependent applications that produce non-greedy,
application limited (thin) flows. However, our hope is that RDB may
greatly improve TCP's position as a viable alternative for applications
transmitting latency sensitive data.

> In my mental model (please correct me if I am wrong), losses on these
> thin streams would mostly resort to RTOs instead of fast recovery, due
> to the bursty nature of Internet losses.

This depends on the transmission pattern of the applications, which
varies to a great deal, also between the different types of
time-dependent applications that produce thin streams. For short flows,
(bursty) loss at the end will result in an RTO (if TLP does not probe),
but the thin streams are often long lived, and the applications
producing them continue to write small data segments to the socket at
intervals of tens to hundreds of milliseconds.

What controls if an RTO and not fast retransmit will resend the packet,
is the number of PIFs, which directly correlates to how often the
application writes data to the socket in relation to the RTT. As long as
the number of packets successfully completing a round trip before the
RTO is >= the dupACK threshold, they will not depend on RTOs (not
considering TLP). Early retransmit and the TCP_THIN_DUPACK socket option
will also affect the likelihood of RTOs vs fast retransmits.

> The HOLB comes from RTO only
> retransmit the first (tiny) unacked packet while a small of new data is
> readily available. But since Linux congestion control is packet-based,
> and loss cwnd is 1, the new data needs to wait until the 1st packet is
> acked which is for another RTT.

If I understand you correctly, you are referring to HOLB on the sender
side, which is the extra delay on new data that is held back when the
connection is CWND-limited. In the paper, we refer to this extra delay
as increased sojourn times for the outgoing data segments.

We do not include this additional sojourn time for the segments on the
sender side in the ACK Latency plots (Fig. 4 in the paper). This is
simply because the pcap traces contain the timestamps when the packets
are sent, and not when the segments are added to the output queue.

When we refer to the HOLB effect in the paper as well as the thesis, we
refer to the extra delays (sojourn times) on the receiver side where
segments are held back (not made available to user space) due to gaps in
the sequence range when packets are lost (we had no reordering).

So, when considering the increased delays due to HOLB on the receiver
side, HOLB is not at all limited to RTOs. Actually, it's mostly not due
to RTOs in the tests we've run, however, this also depends very much on
the transmission pattern of the application as well as loss levels.
In general, HOLB on the receiver side will affect any flow that
transmits a packet with new data after a packet is lost (sender may not
know yet), where the lost packet has not already been retransmitted.

Consider a sender application that performs write calls every 30 ms on a
150 ms RTT link. It will need a CWND that allows 5-6 PIFs to be able to
transmit all new data segments with no extra sojourn times on the sender
side.
When one packet is lost, the next 5 packets that are sent will be held
back on the receiver side due to the missing segment (HOLB). In the best
case scenario, the first dupACK triggers a fast retransmit around the
same time as the fifth packet (after the lost packet) is sent. In that
case, the first segment sent after the lost segment is held back on the
receiver for 150 ms (the time it takes for the dupACK to reach the
sender, and the fast retrans to arrive at the receiver). The second is
held back 120 ms, the third 90 ms, the fourth 60 ms, an the fifth 30 ms.

All of this extra delay is added before the sender even knows there was
a loss. How it decides to react to the loss signal (dupACKs) will
further decide how much

[PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-03-03 Thread Bendik Rønning Opstad


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v6 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM)
   * Minor doc changes

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Minor doc changes

v5 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
   * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
   * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
   * Merged the two if tests for max payload of RDB packet in
 rdb_can_bundle_test()
   * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
 and restructured to reduce indentation.
   * Improved docs
   * Revised commit message to be more detailed.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
 function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
 tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
 tcp_rdb_max_packets after comment from Eric Dumazet about
 not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
 feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
 to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  36 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   2 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  34 +
 net/ipv4/tcp.c |  16 ++-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  48 ---
 net/ipv4/tcp_rdb.c | 228 +
 12 files changed, 375 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

[PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-03-03 Thread Bendik Rønning Opstad

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.

RDB enables a TCP sender to bundle redundant (already sent) data with
TCP packets containing small segments of new data. By resending
un-ACKed data from the output queue in packets with new data, RDB
reduces the need to retransmit data segments on connections
experiencing sporadic packet loss. By avoiding a retransmit, RDB
evades the latency increase of at least one RTT for the lost packet,
as well as alleviating head-of-line blocking for the packets following
the lost packet. This makes the TCP connection more resistant to
latency fluctuations, and reduces the application layer latency
significantly in lossy environments.

Main functionality added:

  o When a packet is scheduled for transmission, RDB builds and
transmits a new SKB containing both the unsent data as well as
data of previously sent packets from the TCP output queue.

  o RDB will only be used for streams classified as thin by the
function tcp_stream_is_thin_dpifl(). This enforces a lower bound
on the ITT for streams that may benefit from RDB, controlled by
the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

RDB can be enabled on a connection with the socket option TCP_RDB, or
on all new connections by setting the sysctl variable
net.ipv4.tcp_rdb=1

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 +++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  15 +++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   2 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  25 
 net/ipv4/tcp.c |  14 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  48 ---
 net/ipv4/tcp_rdb.c | 228 +
 12 files changed, 335 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 6a92b15..8f3f3bf 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_packets - INTEGER
+   Enable restriction on how many previous packets in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 797cefb..0f2c9d1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned in

[PATCH v6 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2016-03-03 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams,
tcp_stream_is_thin(), is based on a static limit of less than 4
packets in flight. This treats streams differently depending on the
connection's RTT, such that a stream on a high RTT link may never be
considered thin, whereas the same application would produce a stream
that would always be thin in a low RTT scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin
stream detection will be independent of the RTT and treat streams
equally based on the transmission pattern, i.e. the inter-transmission
time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/net/tcp.h  | 21 +
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp.c |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index d5df40c..6a92b15 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound inter-transmission time (ITT) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). Based on
+   this threshold, a dynamic packets in flight limit (DPIFL) is
+   calculated, which is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 692db63..d38eae9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND  10
@@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Test if the stream is thin based on
+ *  dynamic PIF limit (DPIFL)
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+   /* Calculate the maximum allowed PIF limit by dividing the RTT by
+* the minimum allowed inter-transmission time (ITT).
+* Tests if PIF < RTT / ITT-lower-bound
+*/
+   return (u64) tcp_packets_in_flight(tp) *
+   sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe60..f04320a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = 
TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -572,6 +573,14 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_thin_dpifl_itt_lower_bound",
+   .data   = _tcp_thin_dpifl_itt_lower_bound,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_d

Re: [PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-03-02 Thread Bendik Rønning Opstad

On 03/02/2016 08:52 PM, David Miller wrote:
> From: "Bendik Rønning Opstad" <bro.de...@gmail.com>
> Date: Wed, 24 Feb 2016 22:12:55 +0100
> 
>> +/* tcp_rdb.c */
>> +void rdb_ack_event(struct sock *sk, u32 flags);
> 
> Please name globally visible symbols with a prefix of "tcp_*", thanks.

Yes, of course. I will fix that for the next version.

Thanks.

Bendik

[PATCH v5 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-02-24 Thread Bendik Rønning Opstad


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v5 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
   * Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
   * Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
   * Merged the two if tests for max payload of RDB packet in
 rdb_can_bundle_test()
   * Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
 and restructured to reduce indentation.
   * Improved docs
   * Revised commit message to be more detailed.

 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
 function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
 tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
 tcp_rdb_max_packets after comment from Eric Dumazet about
 not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
 feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
 to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  36 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   2 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  34 +
 net/ipv4/tcp.c |  16 ++-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  47 ---
 net/ipv4/tcp_rdb.c | 225 +
 12 files changed, 371 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

[PATCH v5 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-24 Thread Bendik Rønning Opstad

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.

RDB enables a TCP sender to bundle redundant (already sent) data with
TCP packets containing small segments of new data. By resending
un-ACKed data from the output queue in packets with new data, RDB
reduces the need to retransmit data segments on connections
experiencing sporadic packet loss. By avoiding a retransmit, RDB
evades the latency increase of at least one RTT for the lost packet,
as well as alleviating head-of-line blocking for the packets following
the lost packet. This makes the TCP connection more resistant to
latency fluctuations, and reduces the application layer latency
significantly in lossy environments.

Main functionality added:

  o When a packet is scheduled for transmission, RDB builds and
transmits a new SKB containing both the unsent data as well as
data of previously sent packets from the TCP output queue.

  o RDB will only be used for streams classified as thin by the
function tcp_stream_is_thin_dpifl(). This enforces a lower bound
on the ITT for streams that may benefit from RDB, controlled by
the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

RDB can be enabled on a connection with the socket option TCP_RDB, or
on all new connections by setting the sysctl variable
net.ipv4.tcp_rdb=1

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 +++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  15 +++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   2 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  25 
 net/ipv4/tcp.c |  14 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  47 ---
 net/ipv4/tcp_rdb.c | 225 +
 12 files changed, 331 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 3b23ed8f..e3869e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_packets - INTEGER
+   Enable restriction on how many previous packets in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eab4f8f..af15a3c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2931,6 +2931,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned in

[PATCH v5 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2016-02-24 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams,
tcp_stream_is_thin(), is based on a static limit of less than 4
packets in flight. This treats streams differently depending on the
connection's RTT, such that a stream on a high RTT link may never be
considered thin, whereas the same application would produce a stream
that would always be thin in a low RTT scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin
stream detection will be independent of the RTT and treat streams
equally based on the transmission pattern, i.e. the inter-transmission
time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/net/tcp.h  | 21 +
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp.c |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 24ce97f..3b23ed8f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound inter-transmission time (ITT) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). Based on
+   this threshold, a dynamic packets in flight limit (DPIFL) is
+   calculated, which is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 692db63..a29300a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND  10
@@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic 
PIF
+ *  limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+   /* Calculate the maximum allowed PIF limit by dividing the RTT by
+* the minimum allowed inter-transmission time (ITT).
+* Tests if PIF < RTT / ITT-lower-bound
+*/
+   return (u64) tcp_packets_in_flight(tp) *
+   sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe60..f04320a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = 
TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -572,6 +573,14 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_thin_dpifl_itt_lower_bound",
+   .data   = _tcp_thin_dpifl_itt_lower_bound,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_d

Re: [PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-19 Thread Bendik Rønning Opstad

On 02/18/2016 04:18 PM, Eric Dumazet wrote:
>>  
>> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>>  {
>>  __copy_skb_header(new, old);
>>  
>> @@ -1061,6 +1061,7 @@ static void copy_skb_header(struct sk_buff *new, const 
>> struct sk_buff *old)
>>  skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
>>  skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
>>  }
>> +EXPORT_SYMBOL(copy_skb_header);
> 
> Why are you exporting this ? tcp is statically linked into vmlinux.

Ah, this is actually leftover from the earlier module based
implementation of RDB. Will remove.

>> +EXPORT_SYMBOL(skb_append_data);
> 
> Same remark here.

Will remove.

> And this is really a tcp helper, you should add a tcp_ prefix.

Certainly.

> About rdb_build_skb() : I do not see where you make sure
> @bytes_in_rdb_skb is not too big ?

The number of previous SKBs in the queue to copy data from is given
by rdb_can_bundle_test(), which tests if total payload does not
exceed the MSS. Only if there is room (within the MSS) will it test
the sysctl options to further restrict bundling:

+   /* We start at xmit_skb->prev, and go backwards. */
+   tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
+   if ((total_payload + skb->len) > mss_now)
+   break;
+
+   if (sysctl_tcp_rdb_max_bytes &&
+   ((total_payload + skb->len) > sysctl_tcp_rdb_max_bytes))
+   break;

I'll combine these two to (total_payload + skb->len) > max_payload

> tcp_rdb_max_bytes & tcp_rdb_max_packets seem to have no .extra2 upper
> limit, so a user could do something really stupid and attempt to crash
> the kernel.

Those sysctl additions are actually a bit buggy, specifically the
proc_handlers.

Is it not sufficient to ensure that 0 is the lowest possible value?
The max payload limit is really min(mss_now, sysctl_tcp_rdb_max_bytes),
so if sysctl_tcp_rdb_max_bytes or sysctl_tcp_rdb_max_packets are set too
large, bundling will simply be limited by the MSS.

> Presumably I would use SKB_MAX_HEAD(MAX_TCP_HEADER) so that we do not
> try high order page allocation.

Do you suggest something like this?:
bytes_in_rdb_skb = min_t(u32, bytes_in_rdb_skb, SKB_MAX_HEAD(MAX_TCP_HEADER));

Is this necessary when bytes_in_rdb_skb will always contain exactly
the required number of bytes for the payload of the (RDB) packet,
which will never be greater than mss_now?

Or is it aimed at scenarios where the page size is so small that
allocating to an MSS (of e.g. 1460) will require high order page
allocation?

Thanks for looking over the code!

Bendik

[PATCH v4 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-02-16 Thread Bendik Rønning Opstad


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v4 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Moved skb_append_data() to tcp_output.c and call this
 function from tcp_collapse_retrans() as well.
   * Merged functionality of create_rdb_skb() into
 tcp_transmit_rdb_skb()
   * Removed one parameter from rdb_can_bundle_test()

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
 tcp_rdb_max_packets after comment from Eric Dumazet about
 not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
 feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
 to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  36 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  35 ++
 net/ipv4/tcp.c |  16 ++-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  49 +---
 net/ipv4/tcp_rdb.c | 215 +
 12 files changed, 365 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

[PATCH v4 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-16 Thread Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
be sent with a modified SKB containing the redundant data of
previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
for streams that may benefit from RDB, controlled by the sysctl
variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl net.ipv4.tcp_rdb=1

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 +++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  15 +++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  26 
 net/ipv4/tcp.c |  14 ++-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  49 +---
 net/ipv4/tcp_rdb.c | 215 +
 12 files changed, 325 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 3b23ed8f..e3869e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_packets - INTEGER
+   Enable restriction on how many previous packets in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3920675..9075041 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2923,6 +2923,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int 
flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bcbf51d..c84de15 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -207,9 +207,10 @@ struct tcp_sock {
} rack;
u16 advmss; /* Advertised MSS   */
u8  unused;
-   u8  nonagle : 4,/* Disable Nagle algorithm? */
+   u8  nonagle : 3,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
+   rdb : 1,/* Redundant Data Bundling enabled  */
repair  :

[PATCH v4 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2016-02-16 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/net/tcp.h  | 21 +
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp.c |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 24ce97f..3b23ed8f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound inter-transmission time (ITT) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). Based on
+   this threshold, a dynamic packets in flight limit (DPIFL) is
+   calculated, which is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index bdd5e1c..a23ce24 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND  10
@@ -264,6 +266,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1645,6 +1648,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic 
PIF
+ *  limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+   /* Calculate the maximum allowed PIF limit by dividing the RTT by
+* the minimum allowed inter-transmission time (ITT).
+* Tests if PIF < RTT / ITT-lower-bound
+*/
+   return (u64) tcp_packets_in_flight(tp) *
+   sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b537338..e7b1b30 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = 
TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -595,6 +596,14 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_thin_dpifl_itt_lower_bound",
+   .data   = _tcp_thin_dpifl_itt_lower_bound,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _d

Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-08 Thread Bendik Rønning Opstad

Sorry guys, I messed up that email by including HTML, and it got
rejected by netdev@vger.kernel.org. I'll resend it properly formatted.

Bendik

On 08/02/16 18:17, Bendik Rønning Opstad wrote:
> Eric, thank you for the feedback!
> 
> On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
>>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.duma...@gmail.com>
> wrote:
>>>> Really this looks very complicated.
>>>
>>> Can you be more specific?
>>
>> A lot of code added, needing maintenance cost for years to come.
> 
> Yes, that is understandable.
> 
>>>> Why not simply append the new skb content to prior one ?
>>>
>>> It's not clear to me what you mean. At what stage in the output engine
>>> do you refer to?
>>>
>>> We want to avoid modifying the data of the SKBs in the output queue,
>>
>> Why ? We already do that, as I pointed out.
> 
> I suspect that we might be talking past each other. It wasn't clear to
> me that we were discussing how to implement this in a different way.
> 
> The current retrans collapse functionality only merges SKBs that
> contain data that has already been sent and is about to be
> retransmitted.
> 
> This differs significantly from RDB, which combines both already
> transmitted data and unsent data in the same packet without changing
> how the data is stored (and the state tracked) in the output queue.
> Another difference is that RDB includes un-ACKed data that is not
> considered lost.
> 
>>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
>>> The header and payload of the first SKB containing data we want to
>>> redundantly transmit is then copied. Then the payload of the SKBs
> following
>>> next in the output queue is appended onto the rdb_skb. The last payload
>>> that is appended is from the first SKB with unsent data, i.e. the
>>> sk_send_head.
>>>
>>> Would you suggest a different approach?
>>>
>>>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb
> is
>>>> really available (ie its clone not sitting/waiting in a qdisc on the
>>>> host)
>>>
>>> Where do you suggest this should be used?
>>
>> To detect if appending data to prior skb is possible.
> 
> I see. As the implementation intentionally avoids modifying SKBs in
> the output queue, this was not obvious.
> 
>> If the prior packet is still in qdisc, no change is allowed,
>> and it is fine : DRB should not trigger anyway.
> 
> Actually, whether the data in the prior SKB is on the wire or is still
> on the host (in qdisc/driver queue) is not relevant. RDB always wants
> to redundantly resend the data if there is room in the packet, because
> the previous packet may become lost.
> 
>>>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
>>>> MAX_TCP_HEADER) available bytes in skb->data.
>>>
>>> Sure, rdb_build_skb() could use this instead of the calculated
>>> bytes_in_rdb_skb.
>>
>> Point is : small packets already have tail room in skb->head
> 
> Yes, I'm aware of that. But we do not allocate new SKBs because we
> think the existing SKBs do not have enough space available. We do it
> to avoid modifications to the SKBs in the output queue.
> 
>> When RDB decides a packet should be merged into the prior one, you can
>> simply copy payload into the tailroom, then free the skb.
>>
>> No skb allocations are needed, only freeing.
> 
> It wasn't clear to me that you suggest a completely different
> implementation approach altogether.
> 
> As I understand you, the approach you suggest is as follows:
> 
> 1. An SKB containing unsent data is processed for transmission (lets
>call it T_SKB)
> 2. Check if the previous SKB (lets call it P_SKB) (containing sent but
>un-ACKed data) has available (tail) room for the payload contained
>in T_SKB.
> 3. If room in P_SKB:
>   * Copy the unsent data from T_SKB to P_SKB by appending it to the
> linear data and update sequence numbers.
>   * Remove T_SKB (which contains only the new and unsent data) from
> the output queue.
>   * Transmit P_SKB, which now contains some already sent data and some
> unsent data.
> 
> 
> If I have misunderstood, can you please elaborate in detail what you
> mean?
> 
> If this is the approach you suggest, I can think of some potential
> downsides that require further considerations:
> 
> 
> 1) ACK-accountin

Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-08 Thread Bendik Rønning Opstad

Eric, thank you for the feedback!

On Wed, Feb 3, 2016 at 8:34 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Wed, 2016-02-03 at 19:17 +0100, Bendik Rønning Opstad wrote:
>> On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.duma...@gmail.com>
wrote:
>>> Really this looks very complicated.
>>
>> Can you be more specific?
>
> A lot of code added, needing maintenance cost for years to come.

Yes, that is understandable.

>>> Why not simply append the new skb content to prior one ?
>>
>> It's not clear to me what you mean. At what stage in the output engine
>> do you refer to?
>>
>> We want to avoid modifying the data of the SKBs in the output queue,
>
> Why ? We already do that, as I pointed out.

I suspect that we might be talking past each other. It wasn't clear to
me that we were discussing how to implement this in a different way.

The current retrans collapse functionality only merges SKBs that
contain data that has already been sent and is about to be
retransmitted.

This differs significantly from RDB, which combines both already
transmitted data and unsent data in the same packet without changing
how the data is stored (and the state tracked) in the output queue.
Another difference is that RDB includes un-ACKed data that is not
considered lost.

>> therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
>> The header and payload of the first SKB containing data we want to
>> redundantly transmit is then copied. Then the payload of the SKBs following
>> next in the output queue is appended onto the rdb_skb. The last payload
>> that is appended is from the first SKB with unsent data, i.e. the
>> sk_send_head.
>>
>> Would you suggest a different approach?
>>
>>> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
>>> really available (ie its clone not sitting/waiting in a qdisc on the
>>> host)
>>
>> Where do you suggest this should be used?
>
> To detect if appending data to prior skb is possible.

I see. As the implementation intentionally avoids modifying SKBs in
the output queue, this was not obvious.

> If the prior packet is still in qdisc, no change is allowed,
> and it is fine : DRB should not trigger anyway.

Actually, whether the data in the prior SKB is on the wire or is still
on the host (in qdisc/driver queue) is not relevant. RDB always wants
to redundantly resend the data if there is room in the packet, because
the previous packet may become lost.

>>> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
>>> MAX_TCP_HEADER) available bytes in skb->data.
>>
>> Sure, rdb_build_skb() could use this instead of the calculated
>> bytes_in_rdb_skb.
>
> Point is : small packets already have tail room in skb->head

Yes, I'm aware of that. But we do not allocate new SKBs because we
think the existing SKBs do not have enough space available. We do it
to avoid modifications to the SKBs in the output queue.

> When RDB decides a packet should be merged into the prior one, you can
> simply copy payload into the tailroom, then free the skb.
>
> No skb allocations are needed, only freeing.

It wasn't clear to me that you suggest a completely different
implementation approach altogether.

As I understand you, the approach you suggest is as follows:

1. An SKB containing unsent data is processed for transmission (lets
   call it T_SKB)
2. Check if the previous SKB (lets call it P_SKB) (containing sent but
   un-ACKed data) has available (tail) room for the payload contained
   in T_SKB.
3. If room in P_SKB:
  * Copy the unsent data from T_SKB to P_SKB by appending it to the
linear data and update sequence numbers.
  * Remove T_SKB (which contains only the new and unsent data) from
the output queue.
  * Transmit P_SKB, which now contains some already sent data and some
unsent data.

If I have misunderstood, can you please elaborate in detail what you
mean?

If this is the approach you suggest, I can think of some potential
downsides that require further considerations:

1) ACK-accounting will work differently

When the previous SKB (P_SKB) is modified by appending the data of the
next SKB (T_SKB), what should happen when an incoming ACK acknowledges
the data that was sent in the original transmission (before the SKB
was modified), but not the data that was appended later?
tcp_clean_rtx_queue currently handles partially ACKed SKBs due to TSO,
in which case the tcp_skb_pcount(skb) > 1. So this function would need
to be modified to handle this for RDB modified SKBs in the queue,
where all the data is located in the linear data buffer (no GSO segs).

How should SACK and retrans flags be handled when one SKB in the
output queue can represent multiple transmitted packets?

2) Ti

Re: [PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-03 Thread Bendik Rønning Opstad

On Tue, Feb 2, 2016 at 9:35 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Tue, 2016-02-02 at 20:23 +0100, Bendik Rønning Opstad wrote:
>>
>>   o When packets are scheduled for transmission, RDB replaces the SKB to
>> be sent with a modified SKB containing the redundant data of
>> previously sent data segments from the TCP output queue.
>
> Really this looks very complicated.

Can you be more specific?

> Why not simply append the new skb content to prior one ?

It's not clear to me what you mean. At what stage in the output engine
do you refer to?

We want to avoid modifying the data of the SKBs in the output queue,
therefore we allocate a new SKB (This SKB is named rdb_skb in the code).
The header and payload of the first SKB containing data we want to
redundantly transmit is then copied. Then the payload of the SKBs following
next in the output queue is appended onto the rdb_skb. The last payload
that is appended is from the first SKB with unsent data, i.e. the
sk_send_head.

Would you suggest a different approach?

> skb_still_in_host_queue(sk, prior_skb) would also tell you if the skb is
> really available (ie its clone not sitting/waiting in a qdisc on the
> host)

Where do you suggest this should be used?

> Note : select_size() always allocate skb with SKB_WITH_OVERHEAD(2048 -
> MAX_TCP_HEADER) available bytes in skb->data.

Sure, rdb_build_skb() could use this instead of the calculated
bytes_in_rdb_skb.

> Also note that tcp_collapse_retrans() is very similar to your needs. You
> might simply expand it.

The functionality shared is the copying of data from one SKB to another, as
well as adjusting sequence numbers and checksum. Unlinking SKBs from the
output queue, modifying the data of SKBs in the output queue, and changing
retrans hints is not shared.

To reduce code duplication, the function skb_append_data in tcp_rdb.c could
be moved to tcp_output.c, and then be called from tcp_collapse_retrans.

Is it something like this you had in mind?

Bendik

[PATCH v3 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-02-02 Thread Bendik Rønning Opstad


Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be
made not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
(http://patchwork.ozlabs.org/patch/510674)

These patches have also been tested with as set of packetdrill scripts
located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
in the paper "Latency and Fairness Trade-Off for Thin Streams using
Redundant Data Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v3 (PATCH):
 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Changed name of sysctl variable from tcp_rdb_max_skbs to
 tcp_rdb_max_packets after comment from Eric Dumazet about
 not exposing internal (kernel) names like skb.
   * Formatting and function docs fixes

v2 (RFC/PATCH):
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
 feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
 to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss

v1 (RFC/PATCH)


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  35 +
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  35 +
 net/ipv4/tcp.c |  16 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 273 +
 12 files changed, 399 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

[PATCH v3 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-02-02 Thread Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
be sent with a modified SKB containing the redundant data of
previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
for streams that may benefit from RDB, controlled by the sysctl
variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  14 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  26 
 net/ipv4/tcp.c |  14 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 273 +
 12 files changed, 359 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index eb42853..14f960d 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_packets - INTEGER
+   Enable restriction on how many previous packets in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c..eb81877 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2914,6 +2914,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int 
flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b386361..da6dae8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
} rack;
u16 advmss; /* Advertised MSS   */
u8  unused;
-   u8  nonagle : 4,/* Disable Nagle algorithm? */
+   u8  nonagle : 3,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
+   rdb : 1,/* Redundant Data Bundling enabled  */
repair  : 1,

[PATCH v3 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2016-02-02 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/net/tcp.h  | 21 +
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp.c |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 73b36d7..eb42853 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound inter-transmission time (ITT) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). Based on
+   this threshold, a dynamic packets in flight limit (DPIFL) is
+   calculated, which is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3dd20fe..2d86bd7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1
 
 /* TCP initial congestion window as per rfc6928 */
 #define TCP_INIT_CWND  10
@@ -271,6 +273,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1649,6 +1652,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic 
PIF
+ *  limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+   /* Calculate the maximum allowed PIF limit by dividing the RTT by
+* the minimum allowed inter-transmission time (ITT).
+* Tests if PIF < RTT / ITT-lower-bound
+*/
+   return (u64) tcp_packets_in_flight(tp) *
+   sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 4d367b4..6014bc4 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -41,6 +41,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = 
TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -687,6 +688,14 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_thin_dpifl_itt_lower_bound",
+   .data   = _tcp_thin_dpifl_itt_lower_bound,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _d

Re: [PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-11-23 Thread Bendik Rønning Opstad

On 23/11/15 18:43, Eric Dumazet wrote:
> On Mon, 2015-11-23 at 17:26 +0100, Bendik Rønning Opstad wrote:
> 
>> > +
>> > +tcp_rdb_max_skbs - INTEGER
>> > +  Enable restriction on how many previous SKBs in the output queue
>> > +  RDB may include data from. A value of 1 will restrict bundling to
>> > +  only the data from the last packet that was sent.
>> > +  Default: 1
>> > +
> skb is an internal thing. I would rather not expose a sysctl with such
> name.
> 
> Can be multi segment or not (if GSO/TSO is enabled)
> 
> So even '1' skb can have very different content, from 1 byte to ~64 KB

I see your point about not exposing the internal naming. What about
tcp_rdb_max_packets?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2015-11-23 Thread Bendik Rønning Opstad


This is a request for comments.

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be made
not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on a recently submitted patch for
tcp_skb_cb (tcp: refactor struct tcp_skb_cb: 
http://patchwork.ozlabs.org/patch/510674)

These patches have been tested with as set of packetdrill scripts located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the 
paper
"Latency and Fairness Trade-Off for Thin Streams using Redundant Data
Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf

Changes:

v2:
 * tcp-Add-DPIFL-thin-stream-detection-mechanism:
   * Change calculation in tcp_stream_is_thin_dpifl based on
 feedback from Eric Dumazet.

 * tcp-Add-Redundant-Data-Bundling-RDB:
   * Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
 to reduce complexity as commented by Neal Cardwell.
   * Cleaned up loss detection code in rdb_check_rtx_queue_loss



Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  35 +
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  35 +
 net/ipv4/tcp.c |  16 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 271 +
 12 files changed, 397 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-11-23 Thread Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
be sent with a modified SKB containing the redundant data of
previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
for streams that may benefit from RDB, controlled by the sysctl
variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  14 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  26 
 net/ipv4/tcp.c |  14 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 271 +
 12 files changed, 357 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 938ae73..1077de1 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
calculated, which is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_skbs - INTEGER
+   Enable restriction on how many previous SKBs in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c9c394b..1639cc0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2806,6 +2806,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int 
flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b386361..da6dae8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
} rack;
u16 advmss; /* Advertised MSS   */
u8  unused;
-   u8  nonagle : 4,/* Disable Nagle algorithm? */
+   u8  nonagle : 3,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
+   rdb : 1,/* Redundant Data Bundling enabled  */
repair  : 1,
frt

[PATCH RFC v2 net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2015-11-23 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/net/tcp.h  | 21 +
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp.c |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 2ea4c45..938ae73 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound inter-transmission time (ITT) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). Based on
+   this threshold, a dynamic packets in flight limit (DPIFL) is
+   calculated, which is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fc457b..deac96f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+/* Lowest possible DPIFL lower bound ITT is 10 ms (1 usec) */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1
 
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND  10
@@ -274,6 +276,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1631,6 +1634,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic 
PIF
+ *  limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+   /* Calculate the maximum allowed PIF limit by dividing the RTT by
+* the minimum allowed inter-transmission time (ITT).
+* Tests if PIF < RTT / ITT-lower-bound
+*/
+   return (u64) tcp_packets_in_flight(tp) *
+   sysctl_tcp_thin_dpifl_itt_lower_bound < (tp->srtt_us >> 3);
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a0bd7a5..5b12446 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int tcp_thin_dpifl_itt_lower_bound_min = 
TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
 
 /* Update system visible IP port range */
 static void set_local_port_range(struct net *net, int range[2])
@@ -709,6 +710,14 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_thin_dpifl_itt_lower_bound",
+   .data   = _tcp_thin_dpifl_itt_lower_bound,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _d

Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)

2015-11-09 Thread Bendik Rønning Opstad

On 24/10/15 14:57, Eric Dumazet wrote:
> Thank you for this very high quality patch submission.
>
> Please give us a few days for proper evaluation.
>
> Thanks !

Guys, thank you very much for taking the time to evaluate this.

Since there haven't been any more feedback or comments I'll submit an
RFCv2 with a few changes which includes removing the Nagle
modification.

After discussing the Nagle change on setsockopt we realize that it
should be evaluated more thoroughly, and is better left for a later
patch submission.

Bendik

P.S.
Trimming the CC list to only those who have responded as gmail says
I'm spamming :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-11-04 Thread Bendik Rønning Opstad

On Monday, November 02, 2015 09:37:54 AM David Laight wrote:
> From: Bendik Rønning Opstad
> > Sent: 23 October 2015 21:50
> > RDB is a mechanism that enables a TCP sender to bundle redundant
> > (already sent) data with TCP packets containing new data. By bundling
> > (retransmitting) already sent data with each TCP packet containing new
> > data, the connection will be more resistant to sporadic packet loss
> > which reduces the application layer latency significantly in congested
> > scenarios.
> 
> What sort of traffic flows do you expect this to help?

As mentioned in the cover letter, RDB is aimed at reducing the
latencies for "thin-stream" traffic often produced by
latency-sensitive applications. This blog post describes RDB and the
underlying motivation:
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp

Further information is available in the links referred to in the blog
post.

> An ssh (or similar) connection will get additional data to send,
> but that sort of data flow needs Nagle in order to reduce the
> number of packets sent.

Whether an application needs to reduce the number of packets sent
depends on the perspective of who you ask. If low latency is of high
priority for the application it may need to increase the number of
packets sent by disabling Nagle to reduce the segments sojourn times
on the sender side.

As for SSH clients, it seems OpenSSH disables Nagle for interactive
sessions.

> OTOH it might benefit from including unacked data if the Nagle
> timer expires.
> Being able to set the Nagle timer on a per-connection basis
> (or maybe using something based on the RTT instead of 2 secs)
> might make packet loss less problematic.

There is no timer for Nagle? The current (Minshall variant)
implementation restricts sending a small segment as long as the
previously transmitted packet was small and is not yet ACKed.

> Data flows that already have Nagle disabled (probably anything that
> isn't command-response and isn't unidirectional bulk data) are
> likely to generate a lot of packets within the RTT.

How many packets such applications need to transmit for optimal
latency varies to a great extent. Packets per RTT is not a very useful
metric in this regard, considering the strict dependency on the RTT.

This is why we propose a dynamic packets in flight limit (DPIFL) that
indirectly relies on the application write frequency, i.e. how often
the application performs write systems calls. This limit is used to
ensure that only applications that write data less frequently than a
certain limit may utilize RDB.

> Resending unacked data will just eat into available network bandwidth
> and could easily make any congestion worse.
>
> I think that means you shouldn't resend data more than once, and/or
> should make sure that the resent data isn't a significant overhead
> on the packet being sent.

It is important to remember what type of traffic flows we are
discussing. The applications RDB is aimed at helping produce
application-limited flows that transmit small amounts of data, both in
terms of payload per packet and packets per second.

Analysis of traces from latency-sensitive applications producing
traffic with thin-stream characteristics show inter-transmission times
ranging from a few ms (typically 20-30 ms on average) to many hundred
ms.
(http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp/#thin_streams)

Increasing the amount of transmitted data will certainly contribute to
congestion to some degree, but it is not (necessarily) an unreasonable
trade-off considering the relatively small amounts of data such
applications transmit compared to greedy flows.

RDB does not cause more packets to be sent through the network, as it
uses available "free" space in packets already scheduled for
transmission. With a bundling limitation of only one previous segment,
the bandwidth requirement is doubled - accounting for headers it would
be less.

By increasing the BW requirement for an application that produces
relatively little data, we still end up with a low BW requirement.
The suggested minimum lower bound inter-transmission time is 10 ms,
meaning that when an application writes data more frequently than
every 10 ms (on average) it will not be allowed to utilize RDB.

To what degree RDB affects competing traffic will of course depend on
the link capacity and the number of simultaneous flows utilizing RDB.
We have performed tests to asses how RDB affects competing traffic. In
one of the test scenarios, 10 RDB-enabled thin streams and 10 regular
TCP thin streams compete against 5 greedy TCP flows over a shared
bottleneck limited to 5Mbit/s. The results from this test show that by
only bundling one previous segment with each packet (segment size: 120
bytes), the effect on the the competing thin-stream traffic is modest.
(http://mlab.no/blog/2015/10/redundant-d

Re: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-10-29 Thread Bendik Rønning Opstad

On Monday, October 26, 2015 02:58:03 PM Yuchung Cheng wrote:
> On Mon, Oct 26, 2015 at 2:35 PM, Andreas Petlund <apetl...@simula.no> wrote:
> > > On 26 Oct 2015, at 15:50, Neal Cardwell <ncardw...@google.com> wrote:
> > > 
> > > On Fri, Oct 23, 2015 at 4:50 PM, Bendik Rønning Opstad
> > > 
> > > <bro.de...@gmail.com> wrote:
> > >> @@ -2409,6 +2412,15 @@ static int do_tcp_setsockopt(struct sock *sk,
> > >> int level,> > 
> > > ...
> > > 
> > >> +   case TCP_RDB:
> > >> +   if (val < 0 || val > 1) {
> > >> +   err = -EINVAL;
> > >> +   } else {
> > >> +   tp->rdb = val;
> > >> +   tp->nonagle = val;
> > > 
> > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > nagle behavior is going to add more extra complexity and unanticipated
> > > behavior than is warranted given the slight possible gain in
> > > convenience to the app writer. What about a model where the
> > > application user just needs to remember to call
> > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > sensible? I see your nice tests at
> > > 
> > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > >   7d8baf703b2c2ac1b> > 
> > > are already doing that. And my sense is that likewise most
> > > well-engineered "thin stream" apps will already be using
> > > setsockopt(TCP_NODELAY). Is that workable?

This is definitely workable. I agree that it may not be an ideal solution to
have TCP_RDB disable Nagle, however, it would be useful with a way to easily
enable RDB and disable Nagle.

> > We have been discussing this a bit back and forth. Your suggestion would
> > be the right thing to keep the nagle semantics less complex and to
> > educate developers in the intrinsics of the transport.
> > 
> > We ended up choosing to implicitly disable nagle since it
> > 1) is incompatible with the logic of RDB.
> > 2) leaving it up to the developer to read the documentation and register
> > the line saying that "failing to set TCP_NODELAY will void the RDB
> > latency gain" will increase the chance of misconfigurations leading to
> > deployment with no effect.
> > 
> > The hope was to help both the well-engineered thin-stream apps and the
> > ones deployed by developers with less detailed knowledge of the
> > transport.
> but would RDB be voided if this developer turns on RDB then turns on
> Nagle later?

It would (to a large degree), but I believe that's ok? The intention with also
disabling Nagle is not to remove control from the application writer, so if
TCP_RDB disables Nagle, they should not be prevented from explicitly enabling
Nagle after enabling RDB.

The idea is to make it as easy as possible for the application writer, and
since Nagle is on by default, it makes sense to change this behavior when the
application has indicated that it values low latencies.

Would a solution with multiple option values to TCP_RDB be acceptable? E.g.
0 = Disable
1 = Enable RDB
2 = Enable RDB and disable Nagle

If the sysctl tcp_rdb accepts the same values, setting the sysctl to 2 would
allow to use and test RDB (with Nagle off) on applications that haven't
explicitly disabled Nagle, which would make the sysctl tcp_rdb even more useful.

Instead of having TCP_RDB modify Nagle, would it be better/acceptable to have a
separate socket option (e.g. TCP_THIN/TCP_THIN_LOW_LATENCY) that enables RDB and
disables Nagle? e.g.
0 = Use default system options?
1 = Enable RDB and disable Nagle

This would separate the modification of Nagle from the TCP_RDB socket option and
make it cleaner?

Such an option could also enable other latency-reducing options like
TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK:
2 = Enable RDB, TCP_THIN_LINEAR_TIMEOUTS, TCP_THIN_DUPACK, and disable Nagle

Bendik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2015-10-24 Thread Bendik Rønning Opstad

On Friday, October 23, 2015 02:44:14 PM Eric Dumazet wrote:
> On Fri, 2015-10-23 at 22:50 +0200, Bendik Rønning Opstad wrote:
> 
> >  
> > +/**
> > + * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on 
> > dynamic PIF
> > + *  limit
> > + * @tp: the tcp_sock struct
> > + *
> > + * Return: true if current packets in flight (PIF) count is lower than
> > + * the dynamic PIF limit, else false
> > + */
> > +static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
> > +{
> > +   u64 dpif_lim = tp->srtt_us >> 3;
> > +   /* Div by is_thin_min_itt_lim, the minimum allowed ITT
> > +* (Inter-transmission time) in usecs.
> > +*/
> > +   do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
> > +   return tcp_packets_in_flight(tp) < dpif_lim;
> > +}
> > +
> This is very strange :
> 
> You are using a do_div() while both operands are 32bits.  A regular
> divide would be ok :
> 
> u32 dpif_lim = (tp->srtt_us >> 3) / tp->thin_dpifl_itt_lower_bound;
> 
> But then, you can avoid the divide by using a multiply, less expensive :
> 
> return(u64)tcp_packets_in_flight(tp) * tp->thin_dpifl_itt_lower_bound 
> <
>   (tp->srtt_us >> 3);
> 

You are of course correct. Will fix this and use multiply. Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC net-next 1/2] tcp: Add DPIFL thin stream detection mechanism

2015-10-23 Thread Bendik Rønning Opstad

The existing mechanism for detecting thin streams (tcp_stream_is_thin)
is based on a static limit of less than 4 packets in flight. This treats
streams differently depending on the connections RTT, such that a stream
on a high RTT link may never be considered thin, whereas the same
application would produce a stream that would always be thin in a low RTT
scenario (e.g. data center).

By calculating a dynamic packets in flight limit (DPIFL), the thin stream
detection will be independent of the RTT and treat streams equally based
on the transmission pattern, i.e. the inter-transmission time (ITT).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/linux/tcp.h|  6 ++
 include/net/tcp.h  | 20 
 net/ipv4/sysctl_net_ipv4.c |  9 +
 net/ipv4/tcp.c |  3 +++
 5 files changed, 46 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 85752c8..b841a76 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -700,6 +700,14 @@ tcp_thin_dupack - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0
 
+tcp_thin_dpifl_itt_lower_bound - INTEGER
+   Controls the lower bound for ITT (inter-transmission time) threshold
+   for when a stream is considered thin. The value is specified in
+   microseconds, and may not be lower than 1 (10 ms). This theshold
+   is used to calculate a dynamic packets in flight limit (DPIFL) which
+   is used to classify whether a stream is thin.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c906f45..fc885db 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -269,6 +269,12 @@ struct tcp_sock {
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
 
+   /* The limit used to identify when a stream is thin based in a minimum
+* allowed inter-transmission time (ITT) in microseconds. This is used
+* to dynamically calculate a max packets in flight limit (DPIFL).
+   */
+   int thin_dpifl_itt_lower_bound;
+
/* OOO segments go in this list. Note that socket lock must be held,
 * as we do not use sk_buff_head lock.
 */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fc457b..6534836 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -215,6 +215,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* TCP thin-stream limits */
 #define TCP_THIN_LINEAR_RETRIES 6   /* After 6 linear retries, do exp. 
backoff */
+#define TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN 1  /* Minimum lower bound is 10 
ms (1 usec) */
 
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND  10
@@ -274,6 +275,7 @@ extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
@@ -1631,6 +1633,24 @@ static inline bool tcp_stream_is_thin(struct tcp_sock 
*tp)
return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
 
+/**
+ * tcp_stream_is_thin_dpifl() - Tests if the stream is thin based on dynamic 
PIF
+ *  limit
+ * @tp: the tcp_sock struct
+ *
+ * Return: true if current packets in flight (PIF) count is lower than
+ * the dynamic PIF limit, else false
+ */
+static inline bool tcp_stream_is_thin_dpifl(const struct tcp_sock *tp)
+{
+   u64 dpif_lim = tp->srtt_us >> 3;
+   /* Div by is_thin_min_itt_lim, the minimum allowed ITT
+* (Inter-transmission time) in usecs.
+*/
+   do_div(dpif_lim, tp->thin_dpifl_itt_lower_bound);
+   return tcp_packets_in_flight(tp) < dpif_lim;
+}
+
 /* /proc */
 enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 25300c5..917fdde 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -42,6 +42,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = {

[PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)

2015-10-23 Thread Bendik Rønning Opstad


This is a request for comments.

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.
Latency-sensitive applications or services, such as online games and
remote desktop, produce traffic with thin-stream characteristics,
characterized by small packets and a relatively high ITT. By bundling
already sent data in packets with new data, RDB alleviates head-of-line
blocking by reducing the need to retransmit data segments when packets
are lost. RDB is a continuation on the work on latency improvements for
TCP in Linux, previously resulting in two thin-stream mechanisms in the
Linux kernel
(https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).

The RDB implementation has been thoroughly tested, and shows
significant latency reductions when packet loss occurs[1]. The tests
show that, by imposing restrictions on the bundling rate, it can be made
not to negatively affect competing traffic in an unfair manner.

Note: Current patch set depends on a recently submitted patch for
tcp_skb_cb (tcp: refactor struct tcp_skb_cb: 
http://patchwork.ozlabs.org/patch/510674)

These patches have been tested with as set of packetdrill scripts located at
https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
(The tests require patching packetdrill with a new socket option:
https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)

Detailed info about the RDB mechanism can be found at
http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as in the 
paper
"Latency and Fairness Trade-Off for Thin Streams using Redundant Data
Bundling in TCP"[2].

[1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
[2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf


Bendik Rønning Opstad (2):
  tcp: Add DPIFL thin stream detection mechanism
  tcp: Add Redundant Data Bundling (RDB)

 Documentation/networking/ip-sysctl.txt |  23 +++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   9 +-
 include/net/tcp.h  |  34 
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  35 
 net/ipv4/tcp.c |  19 ++-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 281 +
 12 files changed, 415 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-10-23 Thread Bendik Rønning Opstad

RDB is a mechanism that enables a TCP sender to bundle redundant
(already sent) data with TCP packets containing new data. By bundling
(retransmitting) already sent data with each TCP packet containing new
data, the connection will be more resistant to sporadic packet loss
which reduces the application layer latency significantly in congested
scenarios.

The main functionality added:

  o Loss detection of hidden loss events: When bundling redundant data
with each packet, packet loss can be hidden from the TCP engine due
to lack of dupACKs. This is because the loss is "repaired" by the
redundant data in the packet coming after the lost packet. Based on
incoming ACKs, such hidden loss events are detected, and CWR state
is entered.

  o When packets are scheduled for transmission, RDB replaces the SKB to
be sent with a modified SKB containing the redundant data of
previously sent data segments from the TCP output queue.

  o RDB will only be used for streams classified as thin by the function
tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT
for streams that may benefit from RDB, controlled by the sysctl
variable tcp_thin_dpifl_itt_lower_bound.

RDB is enabled on a connection with the socket option TCP_RDB, or on all
new connections by setting the sysctl variable tcp_rdb=1.

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Pål Halvorsen <pa...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kristian Evensen <kristian.even...@gmail.com>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  15 ++
 include/linux/skbuff.h |   1 +
 include/linux/tcp.h|   3 +-
 include/net/tcp.h  |  14 ++
 include/uapi/linux/tcp.h   |   1 +
 net/core/skbuff.c  |   3 +-
 net/ipv4/Makefile  |   3 +-
 net/ipv4/sysctl_net_ipv4.c |  26 +++
 net/ipv4/tcp.c |  16 +-
 net/ipv4/tcp_input.c   |   3 +
 net/ipv4/tcp_output.c  |  11 +-
 net/ipv4/tcp_rdb.c | 281 +
 12 files changed, 369 insertions(+), 8 deletions(-)
 create mode 100644 net/ipv4/tcp_rdb.c

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index b841a76..740e6a3 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -708,6 +708,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
is used to classify whether a stream is thin.
Default: 1
 
+tcp_rdb - BOOLEAN
+   Enable RDB for all new TCP connections.
+   Default: 0
+
+tcp_rdb_max_bytes - INTEGER
+   Enable restriction on how many bytes an RDB packet can contain.
+   This is the total amount of payload including the new unsent data.
+   Default: 0
+
+tcp_rdb_max_skbs - INTEGER
+   Enable restriction on how many previous SKBs in the output queue
+   RDB may include data from. A value of 1 will restrict bundling to
+   only the data from the last packet that was sent.
+   Default: 1
+
 tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 24f4dfd..3572d21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2809,6 +2809,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int 
flags);
+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fc885db..f38b889 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -202,9 +202,10 @@ struct tcp_sock {
} rack;
u16 advmss; /* Advertised MSS   */
u8  unused;
-   u8  nonagle : 4,/* Disable Nagle algorithm? */
+   u8  nonagle : 3,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
+   rdb : 1,/* Redundant Data Bundling enabled */
repair  : 1,
frto: 1;/

[PATCH v2 net-next] tcp: Fix CWV being too strict on thin streams

2015-09-23 Thread Bendik Rønning Opstad

Application limited streams such as thin streams, that transmit small
amounts of payload in relatively few packets per RTT, can be prevented
from growing the CWND when in congestion avoidance. This leads to
increased sojourn times for data segments in streams that often transmit
time-dependent data.

Currently, a connection is considered CWND limited only after having
successfully transmitted at least one packet with new data, while at the
same time failing to transmit some unsent data from the output queue
because the CWND is full. Applications that produce small amounts of
data may be left in a state where it is never considered to be CWND
limited, because all unsent data is successfully transmitted each time
an incoming ACK opens up for more data to be transmitted in the send
window.

Fix by always testing whether the CWND is fully used after successful
packet transmissions, such that a connection is considered CWND limited
whenever the CWND has been filled. This is the correct behavior as
specified in RFC2861 (section 3.1).

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Cc: Mads Johannessen <mads...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
Changes in v2:
 - Instead of updating tp->is_cwnd_limited after failing to send
   any packets, test whether CWND is full after successfull packet
   transmissions.
 - Updating commit message according to changes.

 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4cd0b50..57a586f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1827,7 +1827,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb,
 
/* Ok, it looks like it is advisable to defer. */
 
-   if (cong_win < send_win && cong_win < skb->len)
+   if (cong_win < send_win && cong_win <= skb->len)
*is_cwnd_limited = true;
 
return true;
@@ -2060,7 +2060,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
 
cwnd_quota = tcp_cwnd_test(tp, skb);
if (!cwnd_quota) {
-   is_cwnd_limited = true;
if (push_one == 2)
/* Force out a loss probe pkt. */
cwnd_quota = 1;
@@ -2142,6 +2141,7 @@ repair:
/* Send one loss probe per tail loss episode. */
if (push_one != 2)
tcp_schedule_loss_probe(sk);
+   is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams

2015-09-22 Thread Bendik Rønning Opstad

> Thanks for this report!

Thank you for answering!

> When you say "CWND is reduced due to loss", are you talking about RTO
> or Fast Recovery? Do you have any traces you can share that illustrate
> this issue?

The problem is not related to loss recovery, but only occurs in congestion
avoidance state. This is because tcp_is_cwnd_limited(), called from the
congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when
in congestion avoidance.

I've made two traces to compare the effect with and without the patch:
http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/

To ensure that the results are comparable, the traces are produced with an extra
patch applied (shown below) such that both connections are in congestion
avoidance from the beginning.

> Have you verified that this patch fixes the issue you identified? I
> think the fix may be in the wrong place for that scenario, or at least
> incomplete.

Yes, we have tested this and verified that the patch solves the issue in the
test scenarios we've run.

As always (in networking) it can be difficult to reproduce the issue in a
consistent manner, so I will describe the setup we have used.

In our testbed we have a sender host, a bridge and a receiver. The bridge is
configured with 75 ms delay in each direction, giving an RTT of 150 ms.

On bridge:
tc qdisc add dev eth0 handle 1:0 root netem delay 75ms
tc qdisc add dev eth1 handle 1:0 root netem delay 75ms

We have used the tool streamzero (https://bitbucket.org/bendikro/streamzero) to
produce the thin stream traffic. An example where the sender opens a TCP
connection and performs write calls to the socket every 10 ms with 100 bytes per
call for 60 seconds:

On receiver host (10.0.0.22):
./streamzero_srv -p 5010 -A

On sender host:
./streamzero_client -s 10.0.0.22 -p 5010 -v3 -A4 -d 60 -I i:10,S:100

streamzero_client will by default disable Nagle (TCP_NODELAY=1).

Adding loss for a few seconds in netem causes the CWND and ssthresh to
eventually drop to a low value, e.g. 2.

On bridge:
tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay
75ms loss 4%

Removing loss after a few seconds:
tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay
75ms

>From this point and on the CWND will not grow any further, as shown by the 
>print
statement when applying this patch:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..947eb57 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2135,6 +2135,8 @@ repair:
break;
}
 
+   printk("CWND: %d, SSTHRESH: %d, PKTS_OUT: %d\n", tp->snd_cwnd, 
tp->snd_ssthresh, tp->packets_out);
+
if (likely(sent_pkts)) {
if (tcp_in_cwnd_reduction(sk))
tp->prr_out += sent_pkts;


The following patch avoids the need to induce loss by setting the connection in
congestion avoidance from the beginning:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a62e9c7..206b32f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5394,6 +5394,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
 
tcp_init_metrics(sk);
 +
tp->snd_cwnd = tp->snd_ssthresh = 2;
tcp_init_congestion_control(sk);
 
/* Prevent spurious tcp_cwnd_restart() on first data


The traces linked to above are run with this patch applied.

> In the scenario you describe, you say "all the buffered data in the
> output queue was sent within the current CWND", which means that there
> is no unsent data, so in tcp_write_xmit() the call to tcp_send_head()
> would return NULL, so we would not enter the while loop, and could not
> set is_cwnd_limited to true.

I'll describe two example scenarios in detail. In both scenarios we are in
congestion avoidance after experiencing loss. Nagle is disabled.

Scenario 1:
Current CWND is 2 and there is currently one packet in flight. The output queue
contains one SKB with unsent data (payload <= 1 * MSS).

In tcp_write_xmit(), tcp_send_head() will return the SKB with unsent data, and
tcp_cwnd_test() will set cwnd_quota to 1, which means that is_cwnd_limited will
remain false. After sending the packet, sent_pkts will be set to 1, and the
while will break because tcp_send_head() returns NULL. tcp_cwnd_validate() will
be called with is_cwnd_limited=false.

When a new data chunk is put on the output queue, and tcp_write_xmit is called,
tcp_send_head() will return the SKB with unsent data. tcp_cwnd_test() will now
set cwnd_quota to 0 because packets in flight == CWND. is_cwnd_limited will
correctly be set to true, however, since no packets were sent,
tcp_cwnd_validate() will never be called.
(Calling tcp_cwnd_validate() if sent_pkts == 0 will not solve the problem in
this scenario as it will not enter the first if test in tcp_cwnd_validate()
where tp->is_cwnd_limited is updated.)


Scenario 2:
CWND is 2 and there is currently one packet in flight. The output

[PATCH net-next] tcp: Fix CWV being too strict on thin streams

2015-09-18 Thread Bendik Rønning Opstad

Application limited streams such as thin streams, that transmit small
amounts of payload in relatively few packets per RTT, are prevented from
growing the CWND after experiencing loss. This leads to increased
sojourn times for data segments in streams that often transmit
time-dependent data.

After the CWND is reduced due to loss, and an ACK has made room in the
send window for more packets to be transmitted, the CWND will not grow
unless there is more unsent data buffered in the output queue than the
CWND permits to be sent. That is because tcp_cwnd_validate(), which
updates tp->is_cwnd_limited, is only called in tcp_write_xmit() when at
least one packet with new data has been sent. However, if all the
buffered data in the output queue was sent within the current CWND,
is_cwnd_limited will remain false even when there is no more room in the
CWND. While the CWND is fully utilized, any new data put on the output
queue will be held back (i.e. limited by the CWND), but
tp->is_cwnd_limited will not be updated as no packets were transmitted.

Fix by updating tp->is_cwnd_limited if no packets are sent due to the
CWND being fully utilized.

Cc: Andreas Petlund <apetl...@simula.no>
Cc: Carsten Griwodz <gr...@simula.no>
Cc: Jonas Markussen <jona...@ifi.uio.no>
Cc: Kenneth Klette Jonassen <kenne...@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+ker...@gmail.com>
---
 net/ipv4/tcp_output.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..7096b8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2145,6 +2145,11 @@ repair:
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
}
+   /* We are prevented from transmitting because we are limited
+* by the CWND, so update tcp sock with correct value.
+*/
+   if (is_cwnd_limited)
+   tp->is_cwnd_limited = is_cwnd_limited;
return !tp->packets_out && tcp_send_head(sk);
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo