Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
On Mon, Oct 17, 2016 at 8:35 PM, Lawrence Brakmo wrote: > Yuchung and Eric, thank you for your comments. > > It looks like I need to think more about this patch. I was trying > to reduce the likelihood of reordering (which seems even more > important based on Eric¹s comment on pacing), but it seems like > the only way to prevent reordering is to only re-hash after an RTO > or when there are no packets in flight (which may not occur). > Sounds like that should be the same condition as when we set ooo_okay? > > On 10/11/16, 8:56 PM, "Yuchung Cheng" wrote: > >>On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng wrote: >>> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo wrote: Yuchung, thank you for your comments. Responses inline. On 10/11/16, 12:49 PM, "Yuchung Cheng" wrote: >On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo wrote: >> >> The purpose of this patch is to help balance flows across paths. A >>new >> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) >>that >> the txhash (IPv6 flowlabel) will be changed after a non-RTO >>retransmit. >> A probability is used in order to control how many flows are moved >> during a congestion event and prevent the congested path from >>becoming >> under utilized (which could occur if too many flows leave the current >> path). Txhash changes may be delayed in order to decrease the >>likelihood >> that it will trigger retransmists due to too much reordering. >> >> Another sysctl "tcp_retrans_txhash_mode" determines the behavior >>after >> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger >> txhash changes. The idea is to decrease the likelihood of going back >> to a broken path. That is, we don't want flow balancing to trigger >> changes to broken paths. The drawback is that flow balancing does >> not work as well. If the sysctl is greater than 1, then we always >> do flow balancing, even after RTOs. >> >> Tested with packedrill tests (for correctness) and performance >> experiments with 2 and 3 paths. Performance experiments looked at >> aggregate goodput and fairness. For each run, we looked at the ratio >>of >> the goodputs for the fastest and slowest flows. These were averaged >>for >> all the runs. A fairness of 1 means all flows had the same goodput, a >> fairness of 2 means the fastest flow was twice as fast as the slowest >> flow. >> >> The setup for the performance experiments was 4 or 5 serves in a >>rack, >> 10G links. I tested various probabilities, but 20 seemed to have the >> best tradeoff for my setup (small RTTs). >> >> --- node1 - >> sender --- switch --- node2 - switch receiver >> --- node3 - >> >> Scenario 1: One sender sends to one receiver through 2 routes (node1 >>or >> node 2). The output from node1 and node2 is 1G (1gbit/sec). With >>only 2 >> flows, without flow balancing (prob=0) the average goodput is 1.6G >>vs. >> 1.9G with flow balancing due to 2 flows ending up in one link and >>either >> not moving and taking some time to move. Fairness was 1 in all cases. >> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or >>1.2 >> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, >> flow balancing increased fairness. >> >> Scenario 2: One sender to one receiver, through 3 routes (node1,... >> node2). With 6 or 16 flows the goodput was the same for all, but >> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst >> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That >>is, >> prob=20,mode=1 improved average and worst case fairness. >I am wondering if we can build better API with routing layer to >implement this type of feature, instead of creeping the tx_rehashing >logic scatter in TCP. For example, we call dst_negative_advice on TCP >write timeouts. Not sure. The route is not necessarily bad, may be temporarily congested or they may all be congested. If all we want to do is change the txhash (unlike dst_negative_advice), then calling a tx_rehashing function may be the appropriate call. > >On the patch itself, it seems aggressive to (attempt to) rehash every >post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to >identify post-RTO retransmission directly. Thanks, I will add the test. > >is this an implementation of the Flow Bender ? >https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation >.cf >m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx >1u_ >g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf >7BJ >Ol3-oxAzZDEYUG4cE-s&e
Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
Yuchung and Eric, thank you for your comments. It looks like I need to think more about this patch. I was trying to reduce the likelihood of reordering (which seems even more important based on Eric¹s comment on pacing), but it seems like the only way to prevent reordering is to only re-hash after an RTO or when there are no packets in flight (which may not occur). On 10/11/16, 8:56 PM, "Yuchung Cheng" wrote: >On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng wrote: >> On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo wrote: >>> Yuchung, thank you for your comments. Responses inline. >>> >>> On 10/11/16, 12:49 PM, "Yuchung Cheng" wrote: >>> On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo wrote: > > The purpose of this patch is to help balance flows across paths. A >new > sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) >that > the txhash (IPv6 flowlabel) will be changed after a non-RTO >retransmit. > A probability is used in order to control how many flows are moved > during a congestion event and prevent the congested path from >becoming > under utilized (which could occur if too many flows leave the current > path). Txhash changes may be delayed in order to decrease the >likelihood > that it will trigger retransmists due to too much reordering. > > Another sysctl "tcp_retrans_txhash_mode" determines the behavior >after > RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger > txhash changes. The idea is to decrease the likelihood of going back > to a broken path. That is, we don't want flow balancing to trigger > changes to broken paths. The drawback is that flow balancing does > not work as well. If the sysctl is greater than 1, then we always > do flow balancing, even after RTOs. > > Tested with packedrill tests (for correctness) and performance > experiments with 2 and 3 paths. Performance experiments looked at > aggregate goodput and fairness. For each run, we looked at the ratio >of > the goodputs for the fastest and slowest flows. These were averaged >for > all the runs. A fairness of 1 means all flows had the same goodput, a > fairness of 2 means the fastest flow was twice as fast as the slowest > flow. > > The setup for the performance experiments was 4 or 5 serves in a >rack, > 10G links. I tested various probabilities, but 20 seemed to have the > best tradeoff for my setup (small RTTs). > > --- node1 - > sender --- switch --- node2 - switch receiver > --- node3 - > > Scenario 1: One sender sends to one receiver through 2 routes (node1 >or > node 2). The output from node1 and node2 is 1G (1gbit/sec). With >only 2 > flows, without flow balancing (prob=0) the average goodput is 1.6G >vs. > 1.9G with flow balancing due to 2 flows ending up in one link and >either > not moving and taking some time to move. Fairness was 1 in all cases. > For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or >1.2 > for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, > flow balancing increased fairness. > > Scenario 2: One sender to one receiver, through 3 routes (node1,... > node2). With 6 or 16 flows the goodput was the same for all, but > fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst > case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That >is, > prob=20,mode=1 improved average and worst case fairness. I am wondering if we can build better API with routing layer to implement this type of feature, instead of creeping the tx_rehashing logic scatter in TCP. For example, we call dst_negative_advice on TCP write timeouts. >>> >>> Not sure. The route is not necessarily bad, may be temporarily >>>congested >>> or they may all be congested. If all we want to do is change the txhash >>> (unlike dst_negative_advice), then calling a tx_rehashing function may >>> be the appropriate call. >>> On the patch itself, it seems aggressive to (attempt to) rehash every post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to identify post-RTO retransmission directly. >>> >>> Thanks, I will add the test. >>> is this an implementation of the Flow Bender ? https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation .cf m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx 1u_ g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf 7BJ Ol3-oxAzZDEYUG4cE-s&e= >>> >>> Part of flow bender, although there are also some similarities to >>>flowlet >>> switching. >>> > > Scenario 3: One sender to one receiver, 2 routes, one route drops >50% of > the packets. With 7 flows, goodput wa
Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
On Tue, 2016-10-11 at 20:56 -0700, Yuchung Cheng wrote: > I thought more about this patch on my way home and have more > questions: why do we exclude RTO retransmission specifically? also > when we rehash, we'll introduce reordering either in recovery or after > recovery, as some TCP CC like bbr would continue sending regardlessly, > so starting in tcp_ack() with tp->txhash_want does not really prevent > causing more reordering. Note that changing txhash during a non rto retransmit is going to break pacing on a bonding setup, since the change in txhash will likely select a different slave, where MQ+FQ are the qdisc in place.
Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
On Tue, Oct 11, 2016 at 6:01 PM, Yuchung Cheng wrote: > On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo wrote: >> Yuchung, thank you for your comments. Responses inline. >> >> On 10/11/16, 12:49 PM, "Yuchung Cheng" wrote: >> >>>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo wrote: The purpose of this patch is to help balance flows across paths. A new sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit. A probability is used in order to control how many flows are moved during a congestion event and prevent the congested path from becoming under utilized (which could occur if too many flows leave the current path). Txhash changes may be delayed in order to decrease the likelihood that it will trigger retransmists due to too much reordering. Another sysctl "tcp_retrans_txhash_mode" determines the behavior after RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger txhash changes. The idea is to decrease the likelihood of going back to a broken path. That is, we don't want flow balancing to trigger changes to broken paths. The drawback is that flow balancing does not work as well. If the sysctl is greater than 1, then we always do flow balancing, even after RTOs. Tested with packedrill tests (for correctness) and performance experiments with 2 and 3 paths. Performance experiments looked at aggregate goodput and fairness. For each run, we looked at the ratio of the goodputs for the fastest and slowest flows. These were averaged for all the runs. A fairness of 1 means all flows had the same goodput, a fairness of 2 means the fastest flow was twice as fast as the slowest flow. The setup for the performance experiments was 4 or 5 serves in a rack, 10G links. I tested various probabilities, but 20 seemed to have the best tradeoff for my setup (small RTTs). --- node1 - sender --- switch --- node2 - switch receiver --- node3 - Scenario 1: One sender sends to one receiver through 2 routes (node1 or node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2 flows, without flow balancing (prob=0) the average goodput is 1.6G vs. 1.9G with flow balancing due to 2 flows ending up in one link and either not moving and taking some time to move. Fairness was 1 in all cases. For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2 for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, flow balancing increased fairness. Scenario 2: One sender to one receiver, through 3 routes (node1,... node2). With 6 or 16 flows the goodput was the same for all, but fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is, prob=20,mode=1 improved average and worst case fairness. >>>I am wondering if we can build better API with routing layer to >>>implement this type of feature, instead of creeping the tx_rehashing >>>logic scatter in TCP. For example, we call dst_negative_advice on TCP >>>write timeouts. >> >> Not sure. The route is not necessarily bad, may be temporarily congested >> or they may all be congested. If all we want to do is change the txhash >> (unlike dst_negative_advice), then calling a tx_rehashing function may >> be the appropriate call. >> >>> >>>On the patch itself, it seems aggressive to (attempt to) rehash every >>>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to >>>identify post-RTO retransmission directly. >> >> Thanks, I will add the test. >> >>> >>>is this an implementation of the Flow Bender ? >>>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf >>>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_ >>>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ >>>Ol3-oxAzZDEYUG4cE-s&e= >> >> Part of flow bender, although there are also some similarities to flowlet >> switching. >> >>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of the packets. With 7 flows, goodput was the same 1.1G, but fairness was 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then balancing, which does more re-routes, is less fair. Signed-off-by: Lawrence Brakmo --- Documentation/networking/ip-sysctl.txt | 15 +++ include/linux/tcp.h| 4 +++- include/net/tcp.h | 2 ++ net/ipv4/sysctl_net_ipv4.c | 18 ++ net/ipv4/tcp_input.c | 10 ++ net/ipv4/tcp_output.c | 23 ++- net/
Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
On Tue, Oct 11, 2016 at 2:08 PM, Lawrence Brakmo wrote: > Yuchung, thank you for your comments. Responses inline. > > On 10/11/16, 12:49 PM, "Yuchung Cheng" wrote: > >>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo wrote: >>> >>> The purpose of this patch is to help balance flows across paths. A new >>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that >>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit. >>> A probability is used in order to control how many flows are moved >>> during a congestion event and prevent the congested path from becoming >>> under utilized (which could occur if too many flows leave the current >>> path). Txhash changes may be delayed in order to decrease the likelihood >>> that it will trigger retransmists due to too much reordering. >>> >>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after >>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger >>> txhash changes. The idea is to decrease the likelihood of going back >>> to a broken path. That is, we don't want flow balancing to trigger >>> changes to broken paths. The drawback is that flow balancing does >>> not work as well. If the sysctl is greater than 1, then we always >>> do flow balancing, even after RTOs. >>> >>> Tested with packedrill tests (for correctness) and performance >>> experiments with 2 and 3 paths. Performance experiments looked at >>> aggregate goodput and fairness. For each run, we looked at the ratio of >>> the goodputs for the fastest and slowest flows. These were averaged for >>> all the runs. A fairness of 1 means all flows had the same goodput, a >>> fairness of 2 means the fastest flow was twice as fast as the slowest >>> flow. >>> >>> The setup for the performance experiments was 4 or 5 serves in a rack, >>> 10G links. I tested various probabilities, but 20 seemed to have the >>> best tradeoff for my setup (small RTTs). >>> >>> --- node1 - >>> sender --- switch --- node2 - switch receiver >>> --- node3 - >>> >>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or >>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2 >>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs. >>> 1.9G with flow balancing due to 2 flows ending up in one link and either >>> not moving and taking some time to move. Fairness was 1 in all cases. >>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2 >>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, >>> flow balancing increased fairness. >>> >>> Scenario 2: One sender to one receiver, through 3 routes (node1,... >>> node2). With 6 or 16 flows the goodput was the same for all, but >>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst >>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is, >>> prob=20,mode=1 improved average and worst case fairness. >>I am wondering if we can build better API with routing layer to >>implement this type of feature, instead of creeping the tx_rehashing >>logic scatter in TCP. For example, we call dst_negative_advice on TCP >>write timeouts. > > Not sure. The route is not necessarily bad, may be temporarily congested > or they may all be congested. If all we want to do is change the txhash > (unlike dst_negative_advice), then calling a tx_rehashing function may > be the appropriate call. > >> >>On the patch itself, it seems aggressive to (attempt to) rehash every >>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to >>identify post-RTO retransmission directly. > > Thanks, I will add the test. > >> >>is this an implementation of the Flow Bender ? >>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf >>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_ >>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ >>Ol3-oxAzZDEYUG4cE-s&e= > > Part of flow bender, although there are also some similarities to flowlet > switching. > >> >>> >>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of >>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was >>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then >>> balancing, which does more re-routes, is less fair. >>> >>> Signed-off-by: Lawrence Brakmo >>> --- >>> Documentation/networking/ip-sysctl.txt | 15 +++ >>> include/linux/tcp.h| 4 +++- >>> include/net/tcp.h | 2 ++ >>> net/ipv4/sysctl_net_ipv4.c | 18 ++ >>> net/ipv4/tcp_input.c | 10 ++ >>> net/ipv4/tcp_output.c | 23 ++- >>> net/ipv4/tcp_timer.c | 4 >>> 7 files changed, 74 insertions(+), 2 deletions(-) >>> >>> diff --git a/Documentation/networking/ip-sysc
Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
Yuchung, thank you for your comments. Responses inline. On 10/11/16, 12:49 PM, "Yuchung Cheng" wrote: >On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo wrote: >> >> The purpose of this patch is to help balance flows across paths. A new >> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that >> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit. >> A probability is used in order to control how many flows are moved >> during a congestion event and prevent the congested path from becoming >> under utilized (which could occur if too many flows leave the current >> path). Txhash changes may be delayed in order to decrease the likelihood >> that it will trigger retransmists due to too much reordering. >> >> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after >> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger >> txhash changes. The idea is to decrease the likelihood of going back >> to a broken path. That is, we don't want flow balancing to trigger >> changes to broken paths. The drawback is that flow balancing does >> not work as well. If the sysctl is greater than 1, then we always >> do flow balancing, even after RTOs. >> >> Tested with packedrill tests (for correctness) and performance >> experiments with 2 and 3 paths. Performance experiments looked at >> aggregate goodput and fairness. For each run, we looked at the ratio of >> the goodputs for the fastest and slowest flows. These were averaged for >> all the runs. A fairness of 1 means all flows had the same goodput, a >> fairness of 2 means the fastest flow was twice as fast as the slowest >> flow. >> >> The setup for the performance experiments was 4 or 5 serves in a rack, >> 10G links. I tested various probabilities, but 20 seemed to have the >> best tradeoff for my setup (small RTTs). >> >> --- node1 - >> sender --- switch --- node2 - switch receiver >> --- node3 - >> >> Scenario 1: One sender sends to one receiver through 2 routes (node1 or >> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2 >> flows, without flow balancing (prob=0) the average goodput is 1.6G vs. >> 1.9G with flow balancing due to 2 flows ending up in one link and either >> not moving and taking some time to move. Fairness was 1 in all cases. >> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2 >> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, >> flow balancing increased fairness. >> >> Scenario 2: One sender to one receiver, through 3 routes (node1,... >> node2). With 6 or 16 flows the goodput was the same for all, but >> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst >> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is, >> prob=20,mode=1 improved average and worst case fairness. >I am wondering if we can build better API with routing layer to >implement this type of feature, instead of creeping the tx_rehashing >logic scatter in TCP. For example, we call dst_negative_advice on TCP >write timeouts. Not sure. The route is not necessarily bad, may be temporarily congested or they may all be congested. If all we want to do is change the txhash (unlike dst_negative_advice), then calling a tx_rehashing function may be the appropriate call. > >On the patch itself, it seems aggressive to (attempt to) rehash every >post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to >identify post-RTO retransmission directly. Thanks, I will add the test. > >is this an implementation of the Flow Bender ? >https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf >m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_ >g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ >Ol3-oxAzZDEYUG4cE-s&e= Part of flow bender, although there are also some similarities to flowlet switching. > >> >> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of >> the packets. With 7 flows, goodput was the same 1.1G, but fairness was >> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then >> balancing, which does more re-routes, is less fair. >> >> Signed-off-by: Lawrence Brakmo >> --- >> Documentation/networking/ip-sysctl.txt | 15 +++ >> include/linux/tcp.h| 4 +++- >> include/net/tcp.h | 2 ++ >> net/ipv4/sysctl_net_ipv4.c | 18 ++ >> net/ipv4/tcp_input.c | 10 ++ >> net/ipv4/tcp_output.c | 23 ++- >> net/ipv4/tcp_timer.c | 4 >> 7 files changed, 74 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/networking/ip-sysctl.txt >>b/Documentation/networking/ip-sysctl.txt >> index 3db8c67..87a984c 100644 >> --- a/Documentation/networking/ip-sysctl.txt >> +++ b/Documentation/networking
Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits
On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo wrote: > > The purpose of this patch is to help balance flows across paths. A new > sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that > the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit. > A probability is used in order to control how many flows are moved > during a congestion event and prevent the congested path from becoming > under utilized (which could occur if too many flows leave the current > path). Txhash changes may be delayed in order to decrease the likelihood > that it will trigger retransmists due to too much reordering. > > Another sysctl "tcp_retrans_txhash_mode" determines the behavior after > RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger > txhash changes. The idea is to decrease the likelihood of going back > to a broken path. That is, we don't want flow balancing to trigger > changes to broken paths. The drawback is that flow balancing does > not work as well. If the sysctl is greater than 1, then we always > do flow balancing, even after RTOs. > > Tested with packedrill tests (for correctness) and performance > experiments with 2 and 3 paths. Performance experiments looked at > aggregate goodput and fairness. For each run, we looked at the ratio of > the goodputs for the fastest and slowest flows. These were averaged for > all the runs. A fairness of 1 means all flows had the same goodput, a > fairness of 2 means the fastest flow was twice as fast as the slowest > flow. > > The setup for the performance experiments was 4 or 5 serves in a rack, > 10G links. I tested various probabilities, but 20 seemed to have the > best tradeoff for my setup (small RTTs). > > --- node1 - > sender --- switch --- node2 - switch receiver > --- node3 - > > Scenario 1: One sender sends to one receiver through 2 routes (node1 or > node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2 > flows, without flow balancing (prob=0) the average goodput is 1.6G vs. > 1.9G with flow balancing due to 2 flows ending up in one link and either > not moving and taking some time to move. Fairness was 1 in all cases. > For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2 > for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is, > flow balancing increased fairness. > > Scenario 2: One sender to one receiver, through 3 routes (node1,... > node2). With 6 or 16 flows the goodput was the same for all, but > fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst > case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is, > prob=20,mode=1 improved average and worst case fairness. I am wondering if we can build better API with routing layer to implement this type of feature, instead of creeping the tx_rehashing logic scatter in TCP. For example, we call dst_negative_advice on TCP write timeouts. On the patch itself, it seems aggressive to (attempt to) rehash every post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to identify post-RTO retransmission directly. is this an implementation of the Flow Bender ? http://dl.acm.org/citation.cfm?id=2674985 > > Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of > the packets. With 7 flows, goodput was the same 1.1G, but fairness was > 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then > balancing, which does more re-routes, is less fair. > > Signed-off-by: Lawrence Brakmo > --- > Documentation/networking/ip-sysctl.txt | 15 +++ > include/linux/tcp.h| 4 +++- > include/net/tcp.h | 2 ++ > net/ipv4/sysctl_net_ipv4.c | 18 ++ > net/ipv4/tcp_input.c | 10 ++ > net/ipv4/tcp_output.c | 23 ++- > net/ipv4/tcp_timer.c | 4 > 7 files changed, 74 insertions(+), 2 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.txt > b/Documentation/networking/ip-sysctl.txt > index 3db8c67..87a984c 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER > if paths are using per packet load balancing (like bonding rr mode) > Default: 300 > > +tcp_retrans_txhash_mode - INTEGER > + If zero, disable txhash recalculation due to non-RTO retransmissions > + after an RTO. The idea is that broken paths will trigger an RTO and > + we don't want going back to that path due to standard retransmissons > + (flow balancing). The drawback is that balancing is less robust. > + If greater than zero, can always (probabilistically) recalculate > + txhash after non-RTO retransmissions. > + > +tcp_retrans_txhash_prob - INTEGER > + Probability [0 to 100] that we will recalculate txhash when a > +