Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-17 Thread Tom Herbert
On Mon, Oct 17, 2016 at 8:35 PM, Lawrence Brakmo  wrote:
> Yuchung and Eric, thank you for your comments.
>
> It looks like I need to think more about this patch. I was trying
> to reduce the likelihood of reordering (which seems even more
> important based on Eric¹s comment on pacing), but it seems like
> the only way to prevent reordering is to only re-hash after an RTO
> or when there are no packets in flight (which may not occur).
>
Sounds like that should be the same condition as when we set ooo_okay?

>
> On 10/11/16, 8:56 PM, "Yuchung Cheng"  wrote:
>
>>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng  wrote:
>>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo  wrote:
 Yuchung, thank you for your comments. Responses inline.

 On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:

>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>>
>> The purpose of this patch is to help balance flows across paths. A
>>new
>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100)
>>that
>> the txhash (IPv6 flowlabel) will be changed after a non-RTO
>>retransmit.
>> A probability is used in order to control how many flows are moved
>> during a congestion event and prevent the congested path from
>>becoming
>> under utilized (which could occur if too many flows leave the current
>> path). Txhash changes may be delayed in order to decrease the
>>likelihood
>> that it will trigger retransmists due to too much reordering.
>>
>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior
>>after
>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>> txhash changes. The idea is to decrease the likelihood of going back
>> to a broken path. That is, we don't want flow balancing to trigger
>> changes to broken paths. The drawback is that flow balancing does
>> not work as well. If the sysctl is greater than 1, then we always
>> do flow balancing, even after RTOs.
>>
>> Tested with packedrill tests (for correctness) and performance
>> experiments with 2 and 3 paths. Performance experiments looked at
>> aggregate goodput and fairness. For each run, we looked at the ratio
>>of
>> the goodputs for the fastest and slowest flows. These were averaged
>>for
>> all the runs. A fairness of 1 means all flows had the same goodput, a
>> fairness of 2 means the fastest flow was twice as fast as the slowest
>> flow.
>>
>> The setup for the performance experiments was 4 or 5 serves in a
>>rack,
>> 10G links. I tested various probabilities, but 20 seemed to have the
>> best tradeoff for my setup (small RTTs).
>>
>>   --- node1 -
>> sender --- switch --- node2 - switch  receiver
>>   --- node3 -
>>
>> Scenario 1: One sender sends to one receiver through 2 routes (node1
>>or
>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With
>>only 2
>> flows, without flow balancing (prob=0) the average goodput is 1.6G
>>vs.
>> 1.9G with flow balancing due to 2 flows ending up in one link and
>>either
>> not moving and taking some time to move. Fairness was 1 in all cases.
>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or
>>1.2
>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>> flow balancing increased fairness.
>>
>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>> node2). With 6 or 16 flows the goodput was the same for all, but
>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That
>>is,
>> prob=20,mode=1 improved average and worst case fairness.
>I am wondering if we can build better API with routing layer to
>implement this type of feature, instead of creeping the tx_rehashing
>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>write timeouts.

 Not sure. The route is not necessarily bad, may be temporarily
congested
 or they may all be congested. If all we want to do is change the txhash
 (unlike dst_negative_advice), then calling a tx_rehashing function may
 be the appropriate call.

>
>On the patch itself, it seems aggressive to (attempt to) rehash every
>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>identify post-RTO retransmission directly.

 Thanks, I will add the test.

>
>is this an implementation of the Flow Bender ?
>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation
>.cf
>m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx
>1u_

Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-17 Thread Lawrence Brakmo
Yuchung and Eric, thank you for your comments.

It looks like I need to think more about this patch. I was trying
to reduce the likelihood of reordering (which seems even more
important based on Eric¹s comment on pacing), but it seems like
the only way to prevent reordering is to only re-hash after an RTO
or when there are no packets in flight (which may not occur).


On 10/11/16, 8:56 PM, "Yuchung Cheng"  wrote:

>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng  wrote:
>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo  wrote:
>>> Yuchung, thank you for your comments. Responses inline.
>>>
>>> On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:
>>>
On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>
> The purpose of this patch is to help balance flows across paths. A
>new
> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100)
>that
> the txhash (IPv6 flowlabel) will be changed after a non-RTO
>retransmit.
> A probability is used in order to control how many flows are moved
> during a congestion event and prevent the congested path from
>becoming
> under utilized (which could occur if too many flows leave the current
> path). Txhash changes may be delayed in order to decrease the
>likelihood
> that it will trigger retransmists due to too much reordering.
>
> Another sysctl "tcp_retrans_txhash_mode" determines the behavior
>after
> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
> txhash changes. The idea is to decrease the likelihood of going back
> to a broken path. That is, we don't want flow balancing to trigger
> changes to broken paths. The drawback is that flow balancing does
> not work as well. If the sysctl is greater than 1, then we always
> do flow balancing, even after RTOs.
>
> Tested with packedrill tests (for correctness) and performance
> experiments with 2 and 3 paths. Performance experiments looked at
> aggregate goodput and fairness. For each run, we looked at the ratio
>of
> the goodputs for the fastest and slowest flows. These were averaged
>for
> all the runs. A fairness of 1 means all flows had the same goodput, a
> fairness of 2 means the fastest flow was twice as fast as the slowest
> flow.
>
> The setup for the performance experiments was 4 or 5 serves in a
>rack,
> 10G links. I tested various probabilities, but 20 seemed to have the
> best tradeoff for my setup (small RTTs).
>
>   --- node1 -
> sender --- switch --- node2 - switch  receiver
>   --- node3 -
>
> Scenario 1: One sender sends to one receiver through 2 routes (node1
>or
> node 2). The output from node1 and node2 is 1G (1gbit/sec). With
>only 2
> flows, without flow balancing (prob=0) the average goodput is 1.6G
>vs.
> 1.9G with flow balancing due to 2 flows ending up in one link and
>either
> not moving and taking some time to move. Fairness was 1 in all cases.
> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or
>1.2
> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
> flow balancing increased fairness.
>
> Scenario 2: One sender to one receiver, through 3 routes (node1,...
> node2). With 6 or 16 flows the goodput was the same for all, but
> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That
>is,
> prob=20,mode=1 improved average and worst case fairness.
I am wondering if we can build better API with routing layer to
implement this type of feature, instead of creeping the tx_rehashing
logic scatter in TCP. For example, we call dst_negative_advice on TCP
write timeouts.
>>>
>>> Not sure. The route is not necessarily bad, may be temporarily
>>>congested
>>> or they may all be congested. If all we want to do is change the txhash
>>> (unlike dst_negative_advice), then calling a tx_rehashing function may
>>> be the appropriate call.
>>>

On the patch itself, it seems aggressive to (attempt to) rehash every
post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
identify post-RTO retransmission directly.
>>>
>>> Thanks, I will add the test.
>>>

is this an implementation of the Flow Bender ?
https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation
.cf
m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx
1u_
g=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk=MA4fWBLMTGgRS0eGvBjxf
7BJ
Ol3-oxAzZDEYUG4cE-s=
>>>
>>> Part of flow bender, although there are also some similarities to
>>>flowlet
>>> switching.
>>>

>
> Scenario 3: One sender to one receiver, 2 

Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-12 Thread Eric Dumazet
On Tue, 2016-10-11 at 20:56 -0700, Yuchung Cheng wrote:

> I thought more about this patch on my way home and have more
> questions: why do we exclude RTO retransmission specifically? also
> when we rehash, we'll introduce reordering either in recovery or after
> recovery, as some TCP CC like bbr would continue sending regardlessly,
> so starting in tcp_ack() with tp->txhash_want does not really prevent
> causing more reordering.

Note that changing txhash during a non rto retransmit is going to break
pacing on a bonding setup, since the change in txhash will likely select
a different slave, where MQ+FQ are the qdisc in place.




Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-11 Thread Yuchung Cheng
On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng  wrote:
> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo  wrote:
>> Yuchung, thank you for your comments. Responses inline.
>>
>> On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:
>>
>>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:

 The purpose of this patch is to help balance flows across paths. A new
 sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
 the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
 A probability is used in order to control how many flows are moved
 during a congestion event and prevent the congested path from becoming
 under utilized (which could occur if too many flows leave the current
 path). Txhash changes may be delayed in order to decrease the likelihood
 that it will trigger retransmists due to too much reordering.

 Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
 RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
 txhash changes. The idea is to decrease the likelihood of going back
 to a broken path. That is, we don't want flow balancing to trigger
 changes to broken paths. The drawback is that flow balancing does
 not work as well. If the sysctl is greater than 1, then we always
 do flow balancing, even after RTOs.

 Tested with packedrill tests (for correctness) and performance
 experiments with 2 and 3 paths. Performance experiments looked at
 aggregate goodput and fairness. For each run, we looked at the ratio of
 the goodputs for the fastest and slowest flows. These were averaged for
 all the runs. A fairness of 1 means all flows had the same goodput, a
 fairness of 2 means the fastest flow was twice as fast as the slowest
 flow.

 The setup for the performance experiments was 4 or 5 serves in a rack,
 10G links. I tested various probabilities, but 20 seemed to have the
 best tradeoff for my setup (small RTTs).

   --- node1 -
 sender --- switch --- node2 - switch  receiver
   --- node3 -

 Scenario 1: One sender sends to one receiver through 2 routes (node1 or
 node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
 flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
 1.9G with flow balancing due to 2 flows ending up in one link and either
 not moving and taking some time to move. Fairness was 1 in all cases.
 For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
 for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
 flow balancing increased fairness.

 Scenario 2: One sender to one receiver, through 3 routes (node1,...
 node2). With 6 or 16 flows the goodput was the same for all, but
 fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
 case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
 prob=20,mode=1 improved average and worst case fairness.
>>>I am wondering if we can build better API with routing layer to
>>>implement this type of feature, instead of creeping the tx_rehashing
>>>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>>>write timeouts.
>>
>> Not sure. The route is not necessarily bad, may be temporarily congested
>> or they may all be congested. If all we want to do is change the txhash
>> (unlike dst_negative_advice), then calling a tx_rehashing function may
>> be the appropriate call.
>>
>>>
>>>On the patch itself, it seems aggressive to (attempt to) rehash every
>>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>>>identify post-RTO retransmission directly.
>>
>> Thanks, I will add the test.
>>
>>>
>>>is this an implementation of the Flow Bender ?
>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>>>m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_
>>>g=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk=MA4fWBLMTGgRS0eGvBjxf7BJ
>>>Ol3-oxAzZDEYUG4cE-s=
>>
>> Part of flow bender, although there are also some similarities to flowlet
>> switching.
>>
>>>

 Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
 the packets. With 7 flows, goodput was the same 1.1G, but fairness was
 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
 balancing, which does more re-routes, is less fair.

 Signed-off-by: Lawrence Brakmo 
 ---
  Documentation/networking/ip-sysctl.txt | 15 +++
  include/linux/tcp.h|  4 +++-
  include/net/tcp.h  |  2 ++
  net/ipv4/sysctl_net_ipv4.c | 18 ++
  net/ipv4/tcp_input.c   | 10 ++
  

Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-11 Thread Yuchung Cheng
On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo  wrote:
> Yuchung, thank you for your comments. Responses inline.
>
> On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:
>
>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>>>
>>> The purpose of this patch is to help balance flows across paths. A new
>>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
>>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
>>> A probability is used in order to control how many flows are moved
>>> during a congestion event and prevent the congested path from becoming
>>> under utilized (which could occur if too many flows leave the current
>>> path). Txhash changes may be delayed in order to decrease the likelihood
>>> that it will trigger retransmists due to too much reordering.
>>>
>>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
>>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>>> txhash changes. The idea is to decrease the likelihood of going back
>>> to a broken path. That is, we don't want flow balancing to trigger
>>> changes to broken paths. The drawback is that flow balancing does
>>> not work as well. If the sysctl is greater than 1, then we always
>>> do flow balancing, even after RTOs.
>>>
>>> Tested with packedrill tests (for correctness) and performance
>>> experiments with 2 and 3 paths. Performance experiments looked at
>>> aggregate goodput and fairness. For each run, we looked at the ratio of
>>> the goodputs for the fastest and slowest flows. These were averaged for
>>> all the runs. A fairness of 1 means all flows had the same goodput, a
>>> fairness of 2 means the fastest flow was twice as fast as the slowest
>>> flow.
>>>
>>> The setup for the performance experiments was 4 or 5 serves in a rack,
>>> 10G links. I tested various probabilities, but 20 seemed to have the
>>> best tradeoff for my setup (small RTTs).
>>>
>>>   --- node1 -
>>> sender --- switch --- node2 - switch  receiver
>>>   --- node3 -
>>>
>>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
>>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
>>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
>>> 1.9G with flow balancing due to 2 flows ending up in one link and either
>>> not moving and taking some time to move. Fairness was 1 in all cases.
>>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
>>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>>> flow balancing increased fairness.
>>>
>>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>>> node2). With 6 or 16 flows the goodput was the same for all, but
>>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
>>> prob=20,mode=1 improved average and worst case fairness.
>>I am wondering if we can build better API with routing layer to
>>implement this type of feature, instead of creeping the tx_rehashing
>>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>>write timeouts.
>
> Not sure. The route is not necessarily bad, may be temporarily congested
> or they may all be congested. If all we want to do is change the txhash
> (unlike dst_negative_advice), then calling a tx_rehashing function may
> be the appropriate call.
>
>>
>>On the patch itself, it seems aggressive to (attempt to) rehash every
>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>>identify post-RTO retransmission directly.
>
> Thanks, I will add the test.
>
>>
>>is this an implementation of the Flow Bender ?
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>>m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_
>>g=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk=MA4fWBLMTGgRS0eGvBjxf7BJ
>>Ol3-oxAzZDEYUG4cE-s=
>
> Part of flow bender, although there are also some similarities to flowlet
> switching.
>
>>
>>>
>>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
>>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
>>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>>> balancing, which does more re-routes, is less fair.
>>>
>>> Signed-off-by: Lawrence Brakmo 
>>> ---
>>>  Documentation/networking/ip-sysctl.txt | 15 +++
>>>  include/linux/tcp.h|  4 +++-
>>>  include/net/tcp.h  |  2 ++
>>>  net/ipv4/sysctl_net_ipv4.c | 18 ++
>>>  net/ipv4/tcp_input.c   | 10 ++
>>>  net/ipv4/tcp_output.c  | 23 ++-
>>>  net/ipv4/tcp_timer.c   |  4 
>>>  7 files changed, 74 insertions(+), 2 deletions(-)

Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-11 Thread Lawrence Brakmo
Yuchung, thank you for your comments. Responses inline.

On 10/11/16, 12:49 PM, "Yuchung Cheng"  wrote:

>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>>
>> The purpose of this patch is to help balance flows across paths. A new
>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
>> A probability is used in order to control how many flows are moved
>> during a congestion event and prevent the congested path from becoming
>> under utilized (which could occur if too many flows leave the current
>> path). Txhash changes may be delayed in order to decrease the likelihood
>> that it will trigger retransmists due to too much reordering.
>>
>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>> txhash changes. The idea is to decrease the likelihood of going back
>> to a broken path. That is, we don't want flow balancing to trigger
>> changes to broken paths. The drawback is that flow balancing does
>> not work as well. If the sysctl is greater than 1, then we always
>> do flow balancing, even after RTOs.
>>
>> Tested with packedrill tests (for correctness) and performance
>> experiments with 2 and 3 paths. Performance experiments looked at
>> aggregate goodput and fairness. For each run, we looked at the ratio of
>> the goodputs for the fastest and slowest flows. These were averaged for
>> all the runs. A fairness of 1 means all flows had the same goodput, a
>> fairness of 2 means the fastest flow was twice as fast as the slowest
>> flow.
>>
>> The setup for the performance experiments was 4 or 5 serves in a rack,
>> 10G links. I tested various probabilities, but 20 seemed to have the
>> best tradeoff for my setup (small RTTs).
>>
>>   --- node1 -
>> sender --- switch --- node2 - switch  receiver
>>   --- node3 -
>>
>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
>> 1.9G with flow balancing due to 2 flows ending up in one link and either
>> not moving and taking some time to move. Fairness was 1 in all cases.
>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>> flow balancing increased fairness.
>>
>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>> node2). With 6 or 16 flows the goodput was the same for all, but
>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
>> prob=20,mode=1 improved average and worst case fairness.
>I am wondering if we can build better API with routing layer to
>implement this type of feature, instead of creeping the tx_rehashing
>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>write timeouts.

Not sure. The route is not necessarily bad, may be temporarily congested
or they may all be congested. If all we want to do is change the txhash
(unlike dst_negative_advice), then calling a tx_rehashing function may
be the appropriate call.
 
>
>On the patch itself, it seems aggressive to (attempt to) rehash every
>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>identify post-RTO retransmission directly.

Thanks, I will add the test.

>
>is this an implementation of the Flow Bender ?
>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>m-3Fid-3D2674985=DQIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_
>g=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk=MA4fWBLMTGgRS0eGvBjxf7BJ
>Ol3-oxAzZDEYUG4cE-s=

Part of flow bender, although there are also some similarities to flowlet
switching.

>
>>
>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>> balancing, which does more re-routes, is less fair.
>>
>> Signed-off-by: Lawrence Brakmo 
>> ---
>>  Documentation/networking/ip-sysctl.txt | 15 +++
>>  include/linux/tcp.h|  4 +++-
>>  include/net/tcp.h  |  2 ++
>>  net/ipv4/sysctl_net_ipv4.c | 18 ++
>>  net/ipv4/tcp_input.c   | 10 ++
>>  net/ipv4/tcp_output.c  | 23 ++-
>>  net/ipv4/tcp_timer.c   |  4 
>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt
>>b/Documentation/networking/ip-sysctl.txt
>> index 3db8c67..87a984c 100644
>> --- 

Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-11 Thread Yuchung Cheng
On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo  wrote:
>
> The purpose of this patch is to help balance flows across paths. A new
> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
> A probability is used in order to control how many flows are moved
> during a congestion event and prevent the congested path from becoming
> under utilized (which could occur if too many flows leave the current
> path). Txhash changes may be delayed in order to decrease the likelihood
> that it will trigger retransmists due to too much reordering.
>
> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
> txhash changes. The idea is to decrease the likelihood of going back
> to a broken path. That is, we don't want flow balancing to trigger
> changes to broken paths. The drawback is that flow balancing does
> not work as well. If the sysctl is greater than 1, then we always
> do flow balancing, even after RTOs.
>
> Tested with packedrill tests (for correctness) and performance
> experiments with 2 and 3 paths. Performance experiments looked at
> aggregate goodput and fairness. For each run, we looked at the ratio of
> the goodputs for the fastest and slowest flows. These were averaged for
> all the runs. A fairness of 1 means all flows had the same goodput, a
> fairness of 2 means the fastest flow was twice as fast as the slowest
> flow.
>
> The setup for the performance experiments was 4 or 5 serves in a rack,
> 10G links. I tested various probabilities, but 20 seemed to have the
> best tradeoff for my setup (small RTTs).
>
>   --- node1 -
> sender --- switch --- node2 - switch  receiver
>   --- node3 -
>
> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
> 1.9G with flow balancing due to 2 flows ending up in one link and either
> not moving and taking some time to move. Fairness was 1 in all cases.
> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
> flow balancing increased fairness.
>
> Scenario 2: One sender to one receiver, through 3 routes (node1,...
> node2). With 6 or 16 flows the goodput was the same for all, but
> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
> prob=20,mode=1 improved average and worst case fairness.
I am wondering if we can build better API with routing layer to
implement this type of feature, instead of creeping the tx_rehashing
logic scatter in TCP. For example, we call dst_negative_advice on TCP
write timeouts.

On the patch itself, it seems aggressive to (attempt to) rehash every
post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
identify post-RTO retransmission directly.

is this an implementation of the Flow Bender ?
http://dl.acm.org/citation.cfm?id=2674985

>
> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
> balancing, which does more re-routes, is less fair.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  Documentation/networking/ip-sysctl.txt | 15 +++
>  include/linux/tcp.h|  4 +++-
>  include/net/tcp.h  |  2 ++
>  net/ipv4/sysctl_net_ipv4.c | 18 ++
>  net/ipv4/tcp_input.c   | 10 ++
>  net/ipv4/tcp_output.c  | 23 ++-
>  net/ipv4/tcp_timer.c   |  4 
>  7 files changed, 74 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 3db8c67..87a984c 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
> if paths are using per packet load balancing (like bonding rr mode)
> Default: 300
>
> +tcp_retrans_txhash_mode - INTEGER
> +   If zero, disable txhash recalculation due to non-RTO retransmissions
> +   after an RTO. The idea is that broken paths will trigger an RTO and
> +   we don't want going back to that path due to standard retransmissons
> +   (flow balancing). The drawback is that balancing is less robust.
> +   If greater than zero, can always (probabilistically) recalculate
> +   txhash after non-RTO retransmissions.
> +
> +tcp_retrans_txhash_prob - INTEGER
> +   Probability [0 to 100] that we 

[PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-10 Thread Lawrence Brakmo
The purpose of this patch is to help balance flows across paths. A new
sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
A probability is used in order to control how many flows are moved
during a congestion event and prevent the congested path from becoming
under utilized (which could occur if too many flows leave the current
path). Txhash changes may be delayed in order to decrease the likelihood
that it will trigger retransmists due to too much reordering.

Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
txhash changes. The idea is to decrease the likelihood of going back
to a broken path. That is, we don't want flow balancing to trigger
changes to broken paths. The drawback is that flow balancing does
not work as well. If the sysctl is greater than 1, then we always
do flow balancing, even after RTOs.

Tested with packedrill tests (for correctness) and performance
experiments with 2 and 3 paths. Performance experiments looked at
aggregate goodput and fairness. For each run, we looked at the ratio of
the goodputs for the fastest and slowest flows. These were averaged for
all the runs. A fairness of 1 means all flows had the same goodput, a
fairness of 2 means the fastest flow was twice as fast as the slowest
flow.

The setup for the performance experiments was 4 or 5 serves in a rack,
10G links. I tested various probabilities, but 20 seemed to have the
best tradeoff for my setup (small RTTs).

  --- node1 -
sender --- switch --- node2 - switch  receiver
  --- node3 -

Scenario 1: One sender sends to one receiver through 2 routes (node1 or
node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
1.9G with flow balancing due to 2 flows ending up in one link and either
not moving and taking some time to move. Fairness was 1 in all cases.
For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
flow balancing increased fairness.

Scenario 2: One sender to one receiver, through 3 routes (node1,...
node2). With 6 or 16 flows the goodput was the same for all, but
fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
prob=20,mode=1 improved average and worst case fairness.

Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
the packets. With 7 flows, goodput was the same 1.1G, but fairness was
1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
balancing, which does more re-routes, is less fair.

Signed-off-by: Lawrence Brakmo 
---
 Documentation/networking/ip-sysctl.txt | 15 +++
 include/linux/tcp.h|  4 +++-
 include/net/tcp.h  |  2 ++
 net/ipv4/sysctl_net_ipv4.c | 18 ++
 net/ipv4/tcp_input.c   | 10 ++
 net/ipv4/tcp_output.c  | 23 ++-
 net/ipv4/tcp_timer.c   |  4 
 7 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 3db8c67..87a984c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
if paths are using per packet load balancing (like bonding rr mode)
Default: 300
 
+tcp_retrans_txhash_mode - INTEGER
+   If zero, disable txhash recalculation due to non-RTO retransmissions
+   after an RTO. The idea is that broken paths will trigger an RTO and
+   we don't want going back to that path due to standard retransmissons
+   (flow balancing). The drawback is that balancing is less robust.
+   If greater than zero, can always (probabilistically) recalculate
+   txhash after non-RTO retransmissions.
+
+tcp_retrans_txhash_prob - INTEGER
+   Probability [0 to 100] that we will recalculate txhash when a
+   packet is resent not due to RTO (for RTO txhash is always recalculated).
+   The recalculation of the txhash may be delayed to decrease the
+   likelihood that reordering will trigger retransmissons.
+   The purpose is to help balance the flows among the possible paths.
+
 tcp_retrans_collapse - BOOLEAN
Bug-to-bug compatibility with some broken printers.
On retransmit try to send bigger packets to work around bugs in
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a17ae7b..e0e3b7d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -214,7 +214,9 @@ struct tcp_sock {
} rack;
u16 advmss; /* Advertised