Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
On Sat, 30 Jun 2018, Neal Cardwell wrote: > As I mentioned, I ran your patch through all our team's TCP > packetdrill tests, and it passes all of the tests. One of our tests > needed updating, because if there is a non-SACK connection with a > spurious RTO due to a delayed flight of ACKs then the FRTO undo now > happens one ACK later (when we get an ACK that doesn't cover a > retransmit). But that seems fine to me. Yes, this is what is wanted. The non-SACK FRTO cannot make decision on the first cumulative ACK because that could be (often is) triggered by the retransmit but only from the next ACK after that. Even with SACK FRTO, there is a hazard on doing it that early as tail ACK losses can lead to discovery of newly SACKed skbs from ACK of the retransmitted segment. For that to occur, however, the cumulative ACK cannot cover those skbs implying more holes that need to be recovered. Therefore, the window reduction will eventually occur anyway but it would still first do a bogus undo also in that case. > I also cooked the new packetdrill test below to explicitly cover this > case you are addressing (please let me know if you have an alternate > suggestion). > > Tested-by: Neal Cardwell > Acked-by: Neal Cardwell > > Thanks! > neal > > --- > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 >+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 >+0 bind(3, ..., ...) = 0 >+0 listen(3, 1) = 0 > >+0 < S 0:0(0) win 32792 >+0 > S. 0:0(0) ack 1 > +.02 < . 1:1(0) ack 1 win 257 >+0 accept(3, ..., ...) = 4 > > // Send 3 packets. First is really lost. And the dupacks > // for the data packets that arrived at the reciver are slow in arriving. >+0 write(4, ..., 3000) = 3000 >+0 > P. 1:3001(3000) ack 1 > > // RTO and retransmit head. This fills a real loss. > +.22 > . 1:1001(1000) ack 1 > > // Dupacks for packets 2 and 3 arrive. > +.02 < . 1:1(0) ack 1 win 257 >+0 < . 1:1(0) ack 1 win 257 > > // The cumulative ACK for all the data arrives. We do not undo, because > // this is a non-SACK connection, and retransmitted data was ACKed. > // It's good that there's no FRTO undo, since a packet was really lost. > // Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss > // until something beyond high_seq is ACKed. > +.005 < . 1:1(0) ack 3001 win 257 >+0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }% >+0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }% >+0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }% I think that the snd_cwnd is still fishy there but that would require also the other patch from my series (cwnd was 1 so it should be 2 after the cumulative ACK). -- i.
Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
On Fri, 29 Jun 2018, Neal Cardwell wrote: > On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen > wrote: > > > > If SACK is not enabled and the first cumulative ACK after the RTO > > retransmission covers more than the retransmitted skb, a spurious > > FRTO undo will trigger (assuming FRTO is enabled for that RTO). > > The reason is that any non-retransmitted segment acknowledged will > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is > > no indication that it would have been delivered for real (the > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK > > case so the check for that bit won't help like it does with SACK). > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo > > in tcp_process_loss. > > > > We need to use more strict condition for non-SACK case and check > > that none of the cumulatively ACKed segments were retransmitted > > to prove that progress is due to original transmissions. Only then > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in > > non-SACK case. > > > > (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS > > to better indicate its purpose but to keep this change minimal, it > > will be done in another patch). > > > > Besides burstiness and congestion control violations, this problem > > can result in RTO loop: When the loss recovery is prematurely > > undoed, only new data will be transmitted (if available) and > > the next retransmission can occur only after a new RTO which in case > > of multiple losses (that are not for consecutive packets) requires > > one RTO per loss to recover. > > > > Signed-off-by: Ilpo Järvinen > > --- > > net/ipv4/tcp_input.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 045d930..8e5522c 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > > > if (tcp_is_reno(tp)) { > > tcp_remove_reno_sacks(sk, pkts_acked); > > + > > + /* If any of the cumulatively ACKed segments was > > +* retransmitted, non-SACK case cannot confirm that > > +* progress was due to original transmission due to > > +* lack of TCPCB_SACKED_ACKED bits even if some of > > +* the packets may have been never retransmitted. > > +*/ > > + if (flag & FLAG_RETRANS_DATA_ACKED) > > + flag &= ~FLAG_ORIG_SACK_ACKED; > > } else { > > int delta; > > Thanks, Ilpo. I ran this through our TCP packetdrill tests and only > got one failure, which detected the change you made, so that on the > first ACK in a non-SACK F-RTO case we stay in CA_Loss. Great, thanks for testing. > However, depending on the exact scenario we are concerned about here, > this may not be a complete solution. Consider the packetdrill test I > cooked below, which is for a scenario where we imagine a non-SACK > connection, with data packet #1 and all dupacks lost. With your patch > we don't have an F-RTO undo on the first cumulative ACK, which is a > definite improvement. However, we end up with an F-RTO undo on the > *second* cumulative ACK, because the second ACK didn't cover any > retransmitted data. That means that we end up in the second round trip > back in the initial slow-start mode with a very high cwnd and infinite > ssthresh, even though data was actually lost in the first round trip. > > I'm not sure how much we care about all these cases. Perhaps we should > just disable F-RTO if the connection has no SACK enabled? These > non-SACK connections are just such a rare case at this point that I > would propose it's not worth spending too much time on this. > > The following packetdrill script passes when Ilpo's patch is applied. > This documents the behavior, which I think is questionable: > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 >+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 >+0 bind(3, ..., ...) = 0 >+0 listen(3, 1) = 0 > >+0 < S 0:0(0) win 32792 >+0 > S. 0:0(0) ack 1 > +.02 < . 1:1(0) ack 1 win 257 >+0 accept(3, ..., ...) = 4 >+0 write(4, ..., 27000) = 27000 >+0 > . 1:10001(1) ack 1 > > // Suppose 1:1001 is lost and all dupacks are lost. > > // RTO and
[PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
If SACK is not enabled and the first cumulative ACK after the RTO retransmission covers more than the retransmitted skb, a spurious FRTO undo will trigger (assuming FRTO is enabled for that RTO). The reason is that any non-retransmitted segment acknowledged will set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is no indication that it would have been delivered for real (the scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK case so the check for that bit won't help like it does with SACK). Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo in tcp_process_loss. We need to use more strict condition for non-SACK case and check that none of the cumulatively ACKed segments were retransmitted to prove that progress is due to original transmissions. Only then keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in non-SACK case. (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS to better indicate its purpose but to keep this change minimal, it will be done in another patch). Besides burstiness and congestion control violations, this problem can result in RTO loop: When the loss recovery is prematurely undoed, only new data will be transmitted (if available) and the next retransmission can occur only after a new RTO which in case of multiple losses (that are not for consecutive packets) requires one RTO per loss to recover. Signed-off-by: Ilpo Järvinen --- net/ipv4/tcp_input.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 045d930..8e5522c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, if (tcp_is_reno(tp)) { tcp_remove_reno_sacks(sk, pkts_acked); + + /* If any of the cumulatively ACKed segments was +* retransmitted, non-SACK case cannot confirm that +* progress was due to original transmission due to +* lack of TCPCB_SACKED_ACKED bits even if some of +* the packets may have been never retransmitted. +*/ + if (flag & FLAG_RETRANS_DATA_ACKED) + flag &= ~FLAG_ORIG_SACK_ACKED; } else { int delta; -- 2.7.4
Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
On Wed, 28 Mar 2018, Yuchung Cheng wrote: > On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ych...@google.com> wrote: > > > > On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen > > <ilpo.jarvi...@helsinki.fi> wrote: > > > On Tue, 27 Mar 2018, Yuchung Cheng wrote: > > > > > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen > > >> <ilpo.jarvi...@helsinki.fi> wrote: > > >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote: > > >> > > > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > > >> >> <ilpo.jarvi...@helsinki.fi> wrote: > > >> >> > > > >> >> > A miscalculation for the number of acknowledged packets occurs > > >> >> > during > > >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK > > >> >> > covers > > >> >> > any non-retransmitted skbs. The reason is that pkts_acked value > > >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start > > >> >> > after > > >> >> > RTO as it may include segments that were not lost and therefore did > > >> >> > not need retransmissions in the slow start following the RTO. Then > > >> >> > tcp_slow_start will add the excess into cwnd bloating it and > > >> >> > triggering a burst. > > >> >> > > > >> >> > Instead, we want to pass only the number of retransmitted segments > > >> >> > that were covered by the cumulative ACK (and potentially newly sent > > >> >> > data segments too if the cumulative ACK covers that far). > > >> >> > > > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > > >> >> > --- > > >> >> > > >> >> My understanding is there are two problems > > >> >> > > >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires > > >> >> precise cumulatively acked count, not newly acked count? > > >> > > > >> > While I'm not entirely sure if you intented to say that my fix is > > >> > broken > > >> > or not, I thought this very difference alot while making the fix and I > > >> > believe that this fix is needed because of the discontinuity at RTO > > >> > (sacked_out is cleared as we set L-bits + lost_out). This is an > > >> > artifact > > >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep > > >> > that > > >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, > > >> > we > > >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks > > >> > logic. > > >> > > > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate > > >> > tp->delivered, using plain cumulative acked causes congestion control > > >> > breakage later as call to tcp_cong_control will directly use the > > >> > difference in tp->delivered. > > >> > > > >> > This boils down the exact definition of tp->delivered (the one given in > > >> > the header is not detailed enough). I guess you might have better idea > > >> > what it exactly is since one of you has added it? There are subtle > > >> > things > > >> > in the defination that can make it entirely unsuitable for cc > > >> > decisions. > > >> > Should those segments that we (possibly) already counted into > > >> > tp->delivered during (potentially preceeding) CA_Recovery be added to > > >> > it > > >> > for _second time_ or not? This fix avoids such double counting (the > > >> Where is the double counting, assuming normal DUPACK behavior? > > >> > > >> In the non-sack case: > > >> > > >> 1. upon receiving a DUPACK, we assume one packet has been delivered by > > >> incrementing tp->delivered in tcp_add_reno_sack() > > > > > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity > > > I've tried to point out quite many times already)... > > > > > >> 2. upon receiving a partial ACK or an ACK that acks recovery point > > >> (high_seq), tp->delivered is incremented by (cumulati
Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
On Wed, 28 Mar 2018, Yuchung Cheng wrote: > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > > > If SACK is not enabled and the first cumulative ACK after the RTO > > retransmission covers more than the retransmitted skb, a spurious > > FRTO undo will trigger (assuming FRTO is enabled for that RTO). > > The reason is that any non-retransmitted segment acknowledged will > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is > > no indication that it would have been delivered for real (the > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK > > case so the check for that bit won't help like it does with SACK). > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo > > in tcp_process_loss. > > > > We need to use more strict condition for non-SACK case and check > > that none of the cumulatively ACKed segments were retransmitted > > to prove that progress is due to original transmissions. Only then > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in > > non-SACK case. > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > > --- > > net/ipv4/tcp_input.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 4a26c09..c60745c 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > pkts_acked = rexmit_acked + newdata_acked; > > > > tcp_remove_reno_sacks(sk, pkts_acked); > > + > > + /* If any of the cumulatively ACKed segments was > > +* retransmitted, non-SACK case cannot confirm that > > +* progress was due to original transmission due to > > +* lack of TCPCB_SACKED_ACKED bits even if some of > > +* the packets may have been never retransmitted. > > +*/ > > + if (flag & FLAG_RETRANS_DATA_ACKED) > > + flag &= ~FLAG_ORIG_SACK_ACKED; > > How about keeping your excellent comment but move the fix to F-RTO > code directly so it's more clear? this way the flag remains clear that > indicates some never-retransmitted data are acked/sacked. > > // pseudo code for illustration > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 8d480542aa07..f7f3357de618 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -2629,8 +2629,15 @@ static void tcp_process_loss(struct sock *sk, > int flag, bool is_dupack, > if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ > /* Step 3.b. A timeout is spurious if not all data are > * lost, i.e., never-retransmitted data are (s)acked. > +* > +* If any of the cumulatively ACKed segments was > +* retransmitted, non-SACK case cannot confirm that > +* progress was due to original transmission due to > +* lack of TCPCB_SACKED_ACKED bits even if some of > +* the packets may have been never retransmitted. > */ > if ((flag & FLAG_ORIG_SACK_ACKED) && > + (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED) && > tcp_try_undo_loss(sk, true)) > return; Of course I could put the back there but I really like the new place more (which was a result of your suggestion to place the code elsewhere). IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we weren't successful in proving (there in tcp_clean_rtx_queue) that progress was due original transmission and thus I would not want falsely indicate it with that flag. And there's the non-SACK related block anyway already there so it keeps the non-SACK "pollution" off from the SACK code paths. (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition we're after regardless of SACK and less ambiguous in non-SACK case). -- i.
Re: [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
On Wed, 28 Mar 2018, Yuchung Cheng wrote: > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > When a cumulative ACK lands to high_seq at the end of loss > > recovery and SACK is not enabled, the sender needs to avoid > > false fast retransmits (RFC6582). The avoidance mechanisms is > > implemented by remaining in the loss recovery CA state until > > one additional cumulative ACK arrives. During the operation of > > this avoidance mechanism, there is internal transient in the > > use of state variables which will always trigger a bogus undo. > > Do we have to make undo in non-sack perfect? can we consider a much > simpler but imperfect fix of > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 8d480542aa07..95225d9de0af 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -2356,6 +2356,7 @@ static bool tcp_try_undo_recovery(struct sock *sk) > * fast retransmits (RFC2582). SACK TCP is safe. */ > if (!tcp_any_retrans_done(sk)) > tp->retrans_stamp = 0; > + tp->undo_marker = 0; > return true; > } Yes, that's of course a possible and would workaround the issue too. In fact, I initially did that kind of fix for myself (I put it into a block with tp->retrans_stamp = 0 though). But then I realized that it is not that complicated to make the fix locally into tcp_packet_delayed() (except the annoyance of passing all the necessary state parameters through the deep static call-chain but that should pose no big challenge for the compiler to handle I guess). BTW, do you know under what circumstances that tcp_any_retrans_done(sk) would return non-zero here (snd_una == high_seq so those rexmit would need to be above high_seq)? -- i.
Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
On Tue, 27 Mar 2018, Yuchung Cheng wrote: > On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > On Mon, 26 Mar 2018, Yuchung Cheng wrote: > > > >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > >> <ilpo.jarvi...@helsinki.fi> wrote: > >> > > >> > A miscalculation for the number of acknowledged packets occurs during > >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers > >> > any non-retransmitted skbs. The reason is that pkts_acked value > >> > calculated in tcp_clean_rtx_queue is not correct for slow start after > >> > RTO as it may include segments that were not lost and therefore did > >> > not need retransmissions in the slow start following the RTO. Then > >> > tcp_slow_start will add the excess into cwnd bloating it and > >> > triggering a burst. > >> > > >> > Instead, we want to pass only the number of retransmitted segments > >> > that were covered by the cumulative ACK (and potentially newly sent > >> > data segments too if the cumulative ACK covers that far). > >> > > >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > >> > --- > >> > >> My understanding is there are two problems > >> > >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires > >> precise cumulatively acked count, not newly acked count? > > > > While I'm not entirely sure if you intented to say that my fix is broken > > or not, I thought this very difference alot while making the fix and I > > believe that this fix is needed because of the discontinuity at RTO > > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact > > in the imitation of sacked_out for non-SACK but at RTO we can't keep that > > in sync because we set L-bits (and have no S-bits to guide us). Thus, we > > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic. > > > > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate > > tp->delivered, using plain cumulative acked causes congestion control > > breakage later as call to tcp_cong_control will directly use the > > difference in tp->delivered. > > > > This boils down the exact definition of tp->delivered (the one given in > > the header is not detailed enough). I guess you might have better idea > > what it exactly is since one of you has added it? There are subtle things > > in the defination that can make it entirely unsuitable for cc decisions. > > Should those segments that we (possibly) already counted into > > tp->delivered during (potentially preceeding) CA_Recovery be added to it > > for _second time_ or not? This fix avoids such double counting (the > Where is the double counting, assuming normal DUPACK behavior? > > In the non-sack case: > > 1. upon receiving a DUPACK, we assume one packet has been delivered by > incrementing tp->delivered in tcp_add_reno_sack() 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity I've tried to point out quite many times already)... > 2. upon receiving a partial ACK or an ACK that acks recovery point > (high_seq), tp->delivered is incremented by (cumulatively acked - > #dupacks) in tcp_remove_reno_sacks() ...and this won't happen correctly anymore after RTO (since non-SACK won't keep #dupacks due to the discontinuity). Thus we end up adding cumulatively acked - 0 to tp->delivered on those ACKs. > therefore tp->delivered is tracking the # of packets delivered > (sacked, acked, DUPACK'd) with the most information it could have > inferred. Since you didn't answer any of my questions about tp->delivered directly, let me rephrase them to this example (non-SACK, of course): 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0). Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for three segments. How much should tp->delivered be incremented? 2 or 3? ...I think 2 is the right answer. > From congestion control's perspective, it cares about the delivery > information (e.g. how much), not the sequences (what or how). I guess you must have missed my point. I'm talking about "how much" whole the time. It's just about when can we account for it (and when not). > I am pointing out that > > 1) your fix may fix one corner CC packet accounting issue in the > non-SACK and CA_Loss > 2) but it does not fix the other (major) CC packet accounting issue in > the SACK case You say major but it's major if and only if one is using one of the affected cc modules wh
Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
On Mon, 26 Mar 2018, Yuchung Cheng wrote: > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > > > A miscalculation for the number of acknowledged packets occurs during > > RTO recovery whenever SACK is not enabled and a cumulative ACK covers > > any non-retransmitted skbs. The reason is that pkts_acked value > > calculated in tcp_clean_rtx_queue is not correct for slow start after > > RTO as it may include segments that were not lost and therefore did > > not need retransmissions in the slow start following the RTO. Then > > tcp_slow_start will add the excess into cwnd bloating it and > > triggering a burst. > > > > Instead, we want to pass only the number of retransmitted segments > > that were covered by the cumulative ACK (and potentially newly sent > > data segments too if the cumulative ACK covers that far). > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > > --- > > net/ipv4/tcp_input.c | 16 +++- > > 1 file changed, 15 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 9a1b3c1..4a26c09 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > long seq_rtt_us = -1L; > > long ca_rtt_us = -1L; > > u32 pkts_acked = 0; > > + u32 rexmit_acked = 0; > > + u32 newdata_acked = 0; > > u32 last_in_flight = 0; > > bool rtt_update; > > int flag = 0; > > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > } > > > > if (unlikely(sacked & TCPCB_RETRANS)) { > > - if (sacked & TCPCB_SACKED_RETRANS) > > + if (sacked & TCPCB_SACKED_RETRANS) { > > tp->retrans_out -= acked_pcount; > > + rexmit_acked += acked_pcount; > > + } > > flag |= FLAG_RETRANS_DATA_ACKED; > > } else if (!(sacked & TCPCB_SACKED_ACKED)) { > > last_ackt = skb->skb_mstamp; > > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > reord = start_seq; > > if (!after(scb->end_seq, tp->high_seq)) > > flag |= FLAG_ORIG_SACK_ACKED; > > + else > > + newdata_acked += acked_pcount; > > } > > > > if (sacked & TCPCB_SACKED_ACKED) { > > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > } > > > > if (tcp_is_reno(tp)) { > > + /* Due to discontinuity on RTO in the artificial > > +* sacked_out calculations, TCP must restrict > > +* pkts_acked without SACK to rexmits and new data > > +* segments > > +*/ > > + if (icsk->icsk_ca_state == TCP_CA_Loss) > > + pkts_acked = rexmit_acked + newdata_acked; > > + > My understanding is there are two problems > > 1) your fix: the reordering logic in tcp-remove_reno_sacks requires > precise cumulatively acked count, not newly acked count? While I'm not entirely sure if you intented to say that my fix is broken or not, I thought this very difference alot while making the fix and I believe that this fix is needed because of the discontinuity at RTO (sacked_out is cleared as we set L-bits + lost_out). This is an artifact in the imitation of sacked_out for non-SACK but at RTO we can't keep that in sync because we set L-bits (and have no S-bits to guide us). Thus, we cannot anymore "use" those skbs with only L-bit for the reno_sacks logic. In tcp_remove_reno_sacks acked - sacked_out is being used to calculate tp->delivered, using plain cumulative acked causes congestion control breakage later as call to tcp_cong_control will directly use the difference in tp->delivered. This boils down the exact definition of tp->delivered (the one given in the header is not detailed enough). I guess you might have better idea what it exactly is since one of you has added it? There are subtle things in the defination that can make it entirely u
Re: [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
On Tue, 13 Mar 2018, Eric Dumazet wrote: > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > Here is a series of fixes to issues that occur when SACK is not > > enabled for a TCP connection. These are not fixes to just some > > remote corner cases of recovery but many/almost all recoveries > > without SACK will be impacted by one (or even more than one) of > > them. The sender-side changes (1-4) are not mainly, if any, to > > improve performance (throughput) but address correctness > > (congestion control responses should not get incorrectly > > reverted) and burstiness (that may cause additional problems > > later as some of the packets in such bursts may get dropped > > needing again to resort to loss recovery that is likely > > similarly bursty). > > GRO (or similar) adoption a long time ago (not only by linux) had a > serious impact on non SACK flow. > Should we also disable GRO by default ? > (my answer is no, just in case someone wonders) > > I am sorry, but I am still not convinced by your patches, trying to > give a wrong incentive to people keeping their prehistoric stacks, > that have serious problems on wifi anyway, and or middle boxes > "aggregating/compressing ACKS" So now you think that I'm against high perf enhancements (even after having written some bits of them)? However, I think that in case somebody disables something that is enabled by default, be it SACK, GRO, timestamps, etc., TCP he/she will get should still make sense (and that regardless of some middleboxes trying hard to break otherwise working stuff). But I guess you don't agree here(?) and would perhaps even want to remove ability to disable them? Although obviously that wouldn't even work with non-SACK (unless RST or so starts to get used but even that could be worked-around unfortunately). Also, I'm somewhat confused your position here. On one hand you said that tests should be added to regression tests to avoid breaking these non-SACK cases again implying that things should be fixed and on the other hand you seem to say that non-SACK must be left broken? > The last patch is particularly problematic in my opinion, you have not > provided packetdrill tests to prove there was no danger. Can that even be done? Proving absence of danger with packetdrill? AFAIK that kind of proofs are available only with formal verification. > It seems you leave to us the task of dealing with possible issues, > only added a bold changelog. Heh, I tried to add details to the changelog because you explicitly said I should and now you fault me on doing that :-). Also, you're leaving out the detail that I tried to understand the condition that worried you and found out that the code already disallows any shrinking of advertized window for duplicate ACKs (but I guess there just might have been some miscommunication between us given that your last reply to 5/5 v1 made no sense to me). > Since the bugs you claim to fix are at least 10 years old, and your > fixes are far from being trivial, > please wait our packetdrill regression tests being published. > This should happen in less than one month I believe, in time for linux-4.17 > > Note that the publication of the updated packetdrill and test suite is > not an easy task, > since we accumulated a lot of hacks both in kernel to cope with timer > slacks and in packetdrill > for various experimental or private features that are not yet in > upstream kernels. > > So we are cleaning all this, and will be happy to help you if you need. I slightly disagree on the trivial bit here as I think it's trivial to see the changes only affect non-SACK flows (of which you seem to say you don't care if they're broken as that gives incentive to use SACK). But then I'm not too worried about waiting a month. -- i.
[PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
Currently, the TCP code is overly eager to increase window on almost every ACK. It makes those ACKs that the receiver should sent as dupACKs look like they update window that is not considered a real dupACK by the non-SACK sender-side code. Therefore the sender needs to resort to RTO to recover losses as fast retransmit/fast recovery cannot be triggered by such masked duplicate ACKs. This change makes sure that when an ofo segment is received, no change to window is applied if we are going to send a dupACK. Even with this change, the window updates keep being maintained but that occurs "behind the scenes". That is, this change does not interfere with memory management of the flow which could have long-term impact for the progress of the flow but only prevents those updates being seen on the wire on short-term. It's ok to change the window for non-dupACKs such as the first ACK after ofo arrivals start if that ACK was using delayed ACKs and also whenever the ack field advances. As ack field typically advances once per RTT as the first hole is retransmitted, the window updates are not delayed entirely during long recoveries. Even before this fix, tcp_select_window did not allow ACK shrinking the window for duplicate ACKs (that was previously even called "treason" but the old charmy message is gone now). The advertized window can only shrink when also ack field changes which will not be blocked by this change as it is not duplicate ACK. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- include/linux/tcp.h | 3 ++- net/ipv4/tcp_input.c | 5 - net/ipv4/tcp_output.c | 43 +-- 3 files changed, 31 insertions(+), 20 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 8f4c549..e239662 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -225,7 +225,8 @@ struct tcp_sock { fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */ fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */ is_sack_reneg:1,/* in recovery from loss with SACK reneg? */ - unused:2; + dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */ + unused:1; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ unused1 : 1, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 270aa48..4ff192b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4626,6 +4626,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size) static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); bool fragstolen; int eaten; @@ -4669,7 +4670,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) * gap in queue is filled. */ if (RB_EMPTY_ROOT(>out_of_order_queue)) - inet_csk(sk)->icsk_ack.pingpong = 0; + icsk->icsk_ack.pingpong = 0; } if (tp->rx_opt.num_sacks) @@ -4719,6 +4720,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) goto queue_and_out; } + if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER)) + tp->dupack_wnd_lock = 1; tcp_data_queue_ofo(sk, skb); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 6818042..45fbe92 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); u32 old_win = tp->rcv_wnd; - u32 cur_win = tcp_receive_window(tp); - u32 new_win = __tcp_select_window(sk); - - /* Never shrink the offered window */ - if (new_win < cur_win) { - /* Danger Will Robinson! -* Don't update rcv_wup/rcv_wnd here or else -* we will not be able to advertise a zero -* window in time. --DaveM -* -* Relax Will Robinson. -*/ - if (new_win == 0) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPWANTZEROWINDOWADV); - new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); + u32 cur_win; + u32 new_win; + + if (tp->dupack_wnd_lock) { + new_win = old_win; + tp->dupack_wnd_lock = 0; + } else { + cur_win = tcp_receive_window(tp); + new_win = __tcp_select_window(sk); + /* Never shrink the offered window */ +
[PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()
No functional changes. This change simplifies the next change slightly. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c60745c..72ecfbb 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) } } +/* False fast retransmits may occur when SACK is not in use under certain + * conditions (RFC6582). The sender MUST hold old state until something + * *above* high_seq is ACKed to prevent triggering such false fast + * retransmits. SACK TCP is safe. + */ +static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp, + const u32 snd_una) +{ + return tcp_is_reno(tp) && (snd_una == tp->high_seq); +} + static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when) { return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && @@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk) } else if (tp->rack.reo_wnd_persist) { tp->rack.reo_wnd_persist--; } - if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) { - /* Hold old state until something *above* high_seq -* is ACKed. For Reno it is MUST to prevent false -* fast retransmits (RFC2582). SACK TCP is safe. */ + if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) { + /* Hold old state until something *above* high_seq is ACKed +* if false fast retransmit is possible. +*/ if (!tcp_any_retrans_done(sk)) tp->retrans_stamp = 0; return true; -- 2.7.4
[PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
A miscalculation for the number of acknowledged packets occurs during RTO recovery whenever SACK is not enabled and a cumulative ACK covers any non-retransmitted skbs. The reason is that pkts_acked value calculated in tcp_clean_rtx_queue is not correct for slow start after RTO as it may include segments that were not lost and therefore did not need retransmissions in the slow start following the RTO. Then tcp_slow_start will add the excess into cwnd bloating it and triggering a burst. Instead, we want to pass only the number of retransmitted segments that were covered by the cumulative ACK (and potentially newly sent data segments too if the cumulative ACK covers that far). Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9a1b3c1..4a26c09 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, long seq_rtt_us = -1L; long ca_rtt_us = -1L; u32 pkts_acked = 0; + u32 rexmit_acked = 0; + u32 newdata_acked = 0; u32 last_in_flight = 0; bool rtt_update; int flag = 0; @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, } if (unlikely(sacked & TCPCB_RETRANS)) { - if (sacked & TCPCB_SACKED_RETRANS) + if (sacked & TCPCB_SACKED_RETRANS) { tp->retrans_out -= acked_pcount; + rexmit_acked += acked_pcount; + } flag |= FLAG_RETRANS_DATA_ACKED; } else if (!(sacked & TCPCB_SACKED_ACKED)) { last_ackt = skb->skb_mstamp; @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, reord = start_seq; if (!after(scb->end_seq, tp->high_seq)) flag |= FLAG_ORIG_SACK_ACKED; + else + newdata_acked += acked_pcount; } if (sacked & TCPCB_SACKED_ACKED) { @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, } if (tcp_is_reno(tp)) { + /* Due to discontinuity on RTO in the artificial +* sacked_out calculations, TCP must restrict +* pkts_acked without SACK to rexmits and new data +* segments +*/ + if (icsk->icsk_ca_state == TCP_CA_Loss) + pkts_acked = rexmit_acked + newdata_acked; + tcp_remove_reno_sacks(sk, pkts_acked); } else { int delta; -- 2.7.4
[PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
Here is a series of fixes to issues that occur when SACK is not enabled for a TCP connection. These are not fixes to just some remote corner cases of recovery but many/almost all recoveries without SACK will be impacted by one (or even more than one) of them. The sender-side changes (1-4) are not mainly, if any, to improve performance (throughput) but address correctness (congestion control responses should not get incorrectly reverted) and burstiness (that may cause additional problems later as some of the packets in such bursts may get dropped needing again to resort to loss recovery that is likely similarly bursty). v1 -> v2: - Tried to improve changelogs - Reworked FRTO undo fix location - Removed extra parenthesis from EXPR (and while at it, reverse also the sides of &&) - Pass prior_snd_una rather than flag around to avoid moving tcp_packet_delayed call - Pass tp instead of sk. Sk was there only due to a subsequent change (that I think is only net-next material) limiting the use of the transient state to only RTO recoveries as it won't be needed after NewReno recovery that won't do unnecessary rexmits like the non-SACK RTO recovery does v2 -> v3: - Remove unnecessarily added braces tcp: feed correct number of pkts acked to cc tcp: prevent bogus FRTO undos with non-SACK flows tcp: move false FR condition into tcp: prevent bogus undos when SACK is not enabled tcp: send real dupACKs by locking advertized
[PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
When a cumulative ACK lands to high_seq at the end of loss recovery and SACK is not enabled, the sender needs to avoid false fast retransmits (RFC6582). The avoidance mechanisms is implemented by remaining in the loss recovery CA state until one additional cumulative ACK arrives. During the operation of this avoidance mechanism, there is internal transient in the use of state variables which will always trigger a bogus undo. When we enter to this transient state in tcp_try_undo_recovery, tcp_any_retrans_done is often (always?) false resulting in clearing retrans_stamp. On the next cumulative ACK, tcp_try_undo_recovery again executes because CA state still remains in the same recovery state and tcp_may_undo will always return true because tcp_packet_delayed has this condition: return !tp->retrans_stamp || ... Check if the false fast retransmit transient avoidance is in progress in tcp_packet_delayed to avoid bogus undos. Since snd_una has advanced already on this ACK but CA state still remains unchanged (CA state is updated slightly later than undo is checked), prior_snd_una needs to be passed to tcp_packet_delayed (instead of tp->snd_una). Passing prior_snd_una around to the tcp_packet_delayed makes this change look more involved than it really is. The additional checks done in this change only affect non-SACK case, the SACK case remains the same. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 42 ++ 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 72ecfbb..270aa48 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp, /* Nothing was retransmitted or returned timestamp is less * than timestamp of the first retransmission. */ -static inline bool tcp_packet_delayed(const struct tcp_sock *tp) +static inline bool tcp_packet_delayed(const struct tcp_sock *tp, + const u32 prior_snd_una) { - return !tp->retrans_stamp || - tcp_tsopt_ecr_before(tp, tp->retrans_stamp); + if (!tp->retrans_stamp) { + /* Sender will be in a transient state with cleared +* retrans_stamp during false fast retransmit prevention +* mechanism +*/ + return !tcp_false_fast_retrans_possible(tp, prior_snd_una); + } + return tcp_tsopt_ecr_before(tp, tp->retrans_stamp); } /* Undo procedures. */ @@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss) tp->rack.advanced = 1; /* Force RACK to re-exam losses */ } -static inline bool tcp_may_undo(const struct tcp_sock *tp) +static inline bool tcp_may_undo(const struct tcp_sock *tp, + const u32 prior_snd_una) { - return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp)); + return tp->undo_marker && + (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una)); } /* People celebrate: "We love our President!" */ -static bool tcp_try_undo_recovery(struct sock *sk) +static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_may_undo(tp)) { + if (tcp_may_undo(tp, prior_snd_una)) { int mib_idx; /* Happy end! We did not retransmit anything @@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk) } /* Undo during loss recovery after partial ACK or using F-RTO. */ -static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo) +static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una, + bool frto_undo) { struct tcp_sock *tp = tcp_sk(sk); - if (frto_undo || tcp_may_undo(tp)) { + if (frto_undo || tcp_may_undo(tp, prior_snd_una)) { tcp_undo_cwnd_reduction(sk, true); DBGUNDO(sk, "partial loss"); @@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack) * recovered or spurious. Otherwise retransmits more on partial ACKs. */ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, -int *rexmit) +int *rexmit, const u32 prior_snd_una) { struct tcp_sock *tp = tcp_sk(sk); bool recovered = !before(tp->snd_una, tp->high_seq); if ((flag & FLAG_SND_UNA_ADVANCED) && - tcp_try_undo_loss(sk, false)) + tcp_try_undo_loss(sk, prior_snd_una, false)) return; if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ @@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk,
[PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
If SACK is not enabled and the first cumulative ACK after the RTO retransmission covers more than the retransmitted skb, a spurious FRTO undo will trigger (assuming FRTO is enabled for that RTO). The reason is that any non-retransmitted segment acknowledged will set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is no indication that it would have been delivered for real (the scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK case so the check for that bit won't help like it does with SACK). Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo in tcp_process_loss. We need to use more strict condition for non-SACK case and check that none of the cumulatively ACKed segments were retransmitted to prove that progress is due to original transmissions. Only then keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in non-SACK case. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4a26c09..c60745c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, pkts_acked = rexmit_acked + newdata_acked; tcp_remove_reno_sacks(sk, pkts_acked); + + /* If any of the cumulatively ACKed segments was +* retransmitted, non-SACK case cannot confirm that +* progress was due to original transmission due to +* lack of TCPCB_SACKED_ACKED bits even if some of +* the packets may have been never retransmitted. +*/ + if (flag & FLAG_RETRANS_DATA_ACKED) + flag &= ~FLAG_ORIG_SACK_ACKED; } else { int delta; -- 2.7.4
Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled
On Fri, 9 Mar 2018, David Miller wrote: > From: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET) > > > Unfortunately I don't have now permission to publish the time-seq > > graph about it but I've tried to improve the changelog messages so > > that you can better understand under which conditions the problem > > occurs. > > It is indeed extremely unfortunate that you wish to justify a change > for which you cannot provide the supporting data at all. Here is the time-seqno graph about the issue: https://www.cs.helsinki.fi/u/ijjarvin/linux/nonsackbugs/recovery_undo_bug.pdf First the correct CC action (wnd reduction) occurs; then bogus undo causes bursting back to the window with which the congestion losses occurred earlier; because of the burst, some packets get lost due to congestion again. The sender is actually somewhat lucky here: If only one packet would get lost instead of three, the same process would repeat for the next recovery (as cumulative ACK to high_seq condition would reoccur). -- i.
Re: [PATCH v2 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
On Fri, 9 Mar 2018, Sergei Shtylyov wrote: > On 03/09/2018 05:10 PM, Ilpo Järvinen wrote: > > > @@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > - if (!after(scb->end_seq, tp->high_seq)) > > + if (!after(scb->end_seq, tp->high_seq)) { > > flag |= FLAG_ORIG_SACK_ACKED; > > + } else { > > + newdata_acked += acked_pcount; > > + } > >Why add {} where they're not needed? Not for a particular reason (I don't know even myself why I added them here as usually I don't add such extra braces). I'll remove them, thanks. -- i.
Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled
On Fri, 9 Mar 2018, David Miller wrote: > From: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET) > > > Unfortunately I don't have now permission to publish the time-seq > > graph about it but I've tried to improve the changelog messages so > > that you can better understand under which conditions the problem > > occurs. > > It is indeed extremely unfortunate that you wish to justify a change > for which you cannot provide the supporting data at all. Well, the permission for time-seq graps was/is still simply just in pending state and then my patience on sending v2 out ran out :-). Given that I perceived right from the start that the main purpose for that "data" was to create a packetdrill test for that testsuite (that is still in the soon-to-be-published state after all these years ;-)), I didn't find significant benefit from a time-seq graph compared with more verbose description that is now place in the changelog. In fact, I think that time-seq graph well into the flow will be less useful than the current explination in the changelog if one wants to come up a packetdrill test but I'm no packetdrill expert. -- i.
Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled
On Wed, 7 Mar 2018, Yuchung Cheng wrote: > On Wed, Mar 7, 2018 at 12:19 PM, Neal Cardwell <ncardw...@google.com> wrote: > > > > On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > > wrote: > > > A bogus undo may/will trigger when the loss recovery state is > > > kept until snd_una is above high_seq. If tcp_any_retrans_done > > > is zero, retrans_stamp is cleared in this transient state. On > > > the next ACK, tcp_try_undo_recovery again executes and > > > tcp_may_undo will always return true because tcp_packet_delayed > > > has this condition: > > > return !tp->retrans_stamp || ... > > > > > > Check for the false fast retransmit transient condition in > > > tcp_packet_delayed to avoid bogus undos. Since snd_una may have > > > advanced on this ACK but CA state still remains unchanged, > > > prior_snd_una needs to be passed instead of tp->snd_una. > > > > This one also seems like a case where it would be nice to have a > > specific packet-by-packet example, or trace, or packetdrill scenario. > > Something that we might be able to translate into a test, or at least > > to document the issue more explicitly. > > I am hesitate for further logic to make undo "perfect" on non-sack > cases b/c undo is very complicated and SACK is extremely > well-supported today. so a trace to demonstrate how severe this issue > is appreciated. This is not just some remote corner cases to which I perhaps would understand your "making undo perfect" comment. Those undos result in a burst that, at worst, triggers additional buffer overflow because the correct CC action is cancelled. Unfortunately I don't have now permission to publish the time-seq graph about it but I've tried to improve the changelog messages so that you can better understand under which conditions the problem occurs. SACK case remains the same even after this change. I did rework the logic a bit though (pass prior_snd_una rather than flag around) to make it more obvious but the change is unfortunately lengthy (no matter what I pass through the call-chain). -- i.
[PATCH v2 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
A miscalculation for the number of acknowledged packets occurs during RTO recovery whenever SACK is not enabled and a cumulative ACK covers any non-retransmitted skbs. The reason is that pkts_acked value calculated in tcp_clean_rtx_queue is not correct for slow start after RTO as it may include segments that were not lost and therefore did not need retransmissions in the slow start following the RTO. Then tcp_slow_start will add the excess into cwnd bloating it and triggering a burst. Instead, we want to pass only the number of retransmitted segments that were covered by the cumulative ACK (and potentially newly sent data segments too if the cumulative ACK covers that far). Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9a1b3c1..0305f6d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, long seq_rtt_us = -1L; long ca_rtt_us = -1L; u32 pkts_acked = 0; + u32 rexmit_acked = 0; + u32 newdata_acked = 0; u32 last_in_flight = 0; bool rtt_update; int flag = 0; @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, } if (unlikely(sacked & TCPCB_RETRANS)) { - if (sacked & TCPCB_SACKED_RETRANS) + if (sacked & TCPCB_SACKED_RETRANS) { tp->retrans_out -= acked_pcount; + rexmit_acked += acked_pcount; + } flag |= FLAG_RETRANS_DATA_ACKED; } else if (!(sacked & TCPCB_SACKED_ACKED)) { last_ackt = skb->skb_mstamp; @@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, last_in_flight = TCP_SKB_CB(skb)->tx.in_flight; if (before(start_seq, reord)) reord = start_seq; - if (!after(scb->end_seq, tp->high_seq)) + if (!after(scb->end_seq, tp->high_seq)) { flag |= FLAG_ORIG_SACK_ACKED; + } else { + newdata_acked += acked_pcount; + } } if (sacked & TCPCB_SACKED_ACKED) { @@ -3151,6 +3158,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, } if (tcp_is_reno(tp)) { + /* Due to discontinuity on RTO in the artificial +* sacked_out calculations, TCP must restrict +* pkts_acked without SACK to rexmits and new data +* segments +*/ + if (icsk->icsk_ca_state == TCP_CA_Loss) + pkts_acked = rexmit_acked + newdata_acked; + tcp_remove_reno_sacks(sk, pkts_acked); } else { int delta; -- 2.7.4
[PATCH v2 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
Currently, the TCP code is overly eager to increase window on almost every ACK. It makes those ACKs that the receiver should sent as dupACKs look like they update window that is not considered a real dupACK by the non-SACK sender-side code. Therefore the sender needs to resort to RTO to recover losses as fast retransmit/fast recovery cannot be triggered by such masked duplicate ACKs. This change makes sure that when an ofo segment is received, no change to window is applied if we are going to send a dupACK. Even with this change, the window updates keep being maintained but that occurs "behind the scenes". That is, this change does not interfere with memory management of the flow which could have long-term impact for the progress of the flow but only prevents those updates being seen on the wire on short-term. It's ok to change the window for non-dupACKs such as the first ACK after ofo arrivals start if that ACK was using delayed ACKs and also whenever the ack field advances. As ack field typically advances once per RTT as the first hole is retransmitted, the window updates are not delayed entirely during long recoveries. Even before this fix, tcp_select_window did not allow ACK shrinking the window for duplicate ACKs (that was previously even called "treason" but the old charmy message is gone now). The advertized window can only shrink when also ack field changes which will not be blocked by this change as it is not duplicate ACK. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- include/linux/tcp.h | 3 ++- net/ipv4/tcp_input.c | 5 - net/ipv4/tcp_output.c | 43 +-- 3 files changed, 31 insertions(+), 20 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 8f4c549..e239662 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -225,7 +225,8 @@ struct tcp_sock { fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */ fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */ is_sack_reneg:1,/* in recovery from loss with SACK reneg? */ - unused:2; + dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */ + unused:1; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ unused1 : 1, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e67551a..ca49b62 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4627,6 +4627,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size) static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); bool fragstolen; int eaten; @@ -4670,7 +4671,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) * gap in queue is filled. */ if (RB_EMPTY_ROOT(>out_of_order_queue)) - inet_csk(sk)->icsk_ack.pingpong = 0; + icsk->icsk_ack.pingpong = 0; } if (tp->rx_opt.num_sacks) @@ -4720,6 +4721,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) goto queue_and_out; } + if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER)) + tp->dupack_wnd_lock = 1; tcp_data_queue_ofo(sk, skb); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 6818042..45fbe92 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); u32 old_win = tp->rcv_wnd; - u32 cur_win = tcp_receive_window(tp); - u32 new_win = __tcp_select_window(sk); - - /* Never shrink the offered window */ - if (new_win < cur_win) { - /* Danger Will Robinson! -* Don't update rcv_wup/rcv_wnd here or else -* we will not be able to advertise a zero -* window in time. --DaveM -* -* Relax Will Robinson. -*/ - if (new_win == 0) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPWANTZEROWINDOWADV); - new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); + u32 cur_win; + u32 new_win; + + if (tp->dupack_wnd_lock) { + new_win = old_win; + tp->dupack_wnd_lock = 0; + } else { + cur_win = tcp_receive_window(tp); + new_win = __tcp_select_window(sk); + /* Never shrink the offered window */ +
[PATCH v2 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
If SACK is not enabled and the first cumulative ACK after the RTO retransmission covers more than the retransmitted skb, a spurious FRTO undo will trigger (assuming FRTO is enabled for that RTO). The reason is that any non-retransmitted segment acknowledged will set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is no indication that it would have been delivered for real (the scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK case so the check for that bit won't help like it does with SACK). Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo in tcp_process_loss. We need to use more strict condition for non-SACK case and check that none of the cumulatively ACKed segments were retransmitted to prove that progress is due to original transmissions. Only then keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in non-SACK case. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0305f6d..1277afb 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3167,6 +3167,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, pkts_acked = rexmit_acked + newdata_acked; tcp_remove_reno_sacks(sk, pkts_acked); + + /* If any of the cumulatively ACKed segments was +* retransmitted, non-SACK case cannot confirm that +* progress was due to original transmission due to +* lack of TCPCB_SACKED_ACKED bits even if some of +* the packets may have been never retransmitted. +*/ + if (flag & FLAG_RETRANS_DATA_ACKED) + flag &= ~FLAG_ORIG_SACK_ACKED; } else { int delta; -- 2.7.4
[PATCH v2 net 0/5] tcp: fixes to non-SACK TCP
Here is a series of fixes to issues that occur when SACK is not enabled for a TCP connection. These are not fixes to just some remote corner cases of recovery but many/almost all recoveries without SACK will be impacted by one (or even more than one) of them. The bogus bursts that the sender side fixes are for may cause additional problems later as some of the packets in such bursts may get dropped needing again to resort to loss recovery that is likely similarly bursty. v1 -> v2: - Tried to improve changelogs - Reworked FRTO undo fix location - Removed extra parenthesis from EXPR (and while at it, reverse also the sides of &&) - Pass prior_snd_una rather than flag around to avoid moving tcp_packet_delayed call - Pass tp instead of sk. sk was there only due to a subsequent change (that I think is only net-next material) limiting the use of the false FR avoidance to only RTO recoveries (as it won't be needed after NewReno recovery that won't do unnecessary rexmits like the non-SACK RTO recovery does) tcp: feed correct number of pkts acked to cc tcp: prevent bogus FRTO undos with non-SACK flows tcp: move false FR condition into tcp: prevent bogus undos when SACK is not enabled tcp: send real dupACKs by locking advertized
[PATCH v2 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()
No functional changes. This change simplifies the next change slightly. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1277afb..4ccba94 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) } } +/* False fast retransmits may occur when SACK is not in use under certain + * conditions (RFC6582). The sender MUST hold old state until something + * *above* high_seq is ACKed to prevent triggering such false fast + * retransmits. SACK TCP is safe. + */ +static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp, + const u32 snd_una) +{ + return tcp_is_reno(tp) && (snd_una == tp->high_seq); +} + static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when) { return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && @@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk) } else if (tp->rack.reo_wnd_persist) { tp->rack.reo_wnd_persist--; } - if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) { - /* Hold old state until something *above* high_seq -* is ACKed. For Reno it is MUST to prevent false -* fast retransmits (RFC2582). SACK TCP is safe. */ + if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) { + /* Hold old state until something *above* high_seq is ACKed +* if false fast retransmit is possible. +*/ if (!tcp_any_retrans_done(sk)) tp->retrans_stamp = 0; return true; -- 2.7.4
[PATCH v2 net 4/5] tcp: prevent bogus undos when SACK is not enabled
When a cumulative ACK lands to high_seq at the end of loss recovery and SACK is not enabled, the sender needs to avoid false fast retransmits (RFC6582). The avoidance mechanisms is implemented by remaining in the loss recovery CA state until one additional cumulative ACK arrives. During the operation of this avoidance mechanism, there is internal transient in the use of state variables which will always trigger a bogus undo. When we enter to this transient state in tcp_try_undo_recovery, tcp_any_retrans_done is often (always?) false resulting in clearing retrans_stamp. On the next cumulative ACK, tcp_try_undo_recovery again executes because CA state still remains in the same recovery state and tcp_may_undo will always return true because tcp_packet_delayed has this condition: return !tp->retrans_stamp || ... Check if the false fast retransmit transient avoidance is in progress in tcp_packet_delayed to avoid bogus undos. Since snd_una has advanced already on this ACK but CA state still remains unchanged (CA state is updated slightly later than undo is checked), prior_snd_una needs to be passed to tcp_packet_delayed (instead of tp->snd_una). Passing prior_snd_una around to the tcp_packet_delayed makes this change look more involved than it really is. The additional checks done in this change only affect non-SACK case, the SACK case remains the same. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 42 ++ 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4ccba94..e67551a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp, /* Nothing was retransmitted or returned timestamp is less * than timestamp of the first retransmission. */ -static inline bool tcp_packet_delayed(const struct tcp_sock *tp) +static inline bool tcp_packet_delayed(const struct tcp_sock *tp, + const u32 prior_snd_una) { - return !tp->retrans_stamp || - tcp_tsopt_ecr_before(tp, tp->retrans_stamp); + if (!tp->retrans_stamp) { + /* Sender will be in a transient state with cleared +* retrans_stamp during false fast retransmit prevention +* mechanism +*/ + return !tcp_false_fast_retrans_possible(tp, prior_snd_una); + } + return tcp_tsopt_ecr_before(tp, tp->retrans_stamp); } /* Undo procedures. */ @@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss) tp->rack.advanced = 1; /* Force RACK to re-exam losses */ } -static inline bool tcp_may_undo(const struct tcp_sock *tp) +static inline bool tcp_may_undo(const struct tcp_sock *tp, + const u32 prior_snd_una) { - return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp)); + return tp->undo_marker && + (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una)); } /* People celebrate: "We love our President!" */ -static bool tcp_try_undo_recovery(struct sock *sk) +static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_may_undo(tp)) { + if (tcp_may_undo(tp, prior_snd_una)) { int mib_idx; /* Happy end! We did not retransmit anything @@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk) } /* Undo during loss recovery after partial ACK or using F-RTO. */ -static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo) +static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una, + bool frto_undo) { struct tcp_sock *tp = tcp_sk(sk); - if (frto_undo || tcp_may_undo(tp)) { + if (frto_undo || tcp_may_undo(tp, prior_snd_una)) { tcp_undo_cwnd_reduction(sk, true); DBGUNDO(sk, "partial loss"); @@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack) * recovered or spurious. Otherwise retransmits more on partial ACKs. */ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, -int *rexmit) +int *rexmit, const u32 prior_snd_una) { struct tcp_sock *tp = tcp_sk(sk); bool recovered = !before(tp->snd_una, tp->high_seq); if ((flag & FLAG_SND_UNA_ADVANCED) && - tcp_try_undo_loss(sk, false)) + tcp_try_undo_loss(sk, prior_snd_una, false)) return; if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ @@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk,
Re: [PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
On Wed, 7 Mar 2018, Yuchung Cheng wrote: > On Wed, Mar 7, 2018 at 11:24 AM, Neal Cardwell <ncardw...@google.com> wrote: > > On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > > wrote: > > > > > > In a non-SACK case, any non-retransmitted segment acknowledged will > > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is > > > no indication that it would have been delivered for real (the > > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK > > > case). This causes bogus undos in ordinary RTO recoveries where > > > segments are lost here and there, with a few delivered segments in > > > between losses. A cumulative ACKs will cover retransmitted ones at > > > the bottom and the non-retransmitted ones following that causing > > > FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results > > > in a spurious FRTO undo. > > > > > > We need to make the check more strict for non-SACK case and check > > > that none of the cumulatively ACKed segments were retransmitted, > > > which would be the case for the last step of FRTO algorithm as we > > > sent out only new segments previously. Only then, allow FRTO undo > > > to proceed in non-SACK case. > > > > Hi Ilpo - Do you have a packet trace or (even better) packetdrill > > script illustrating this issue? It would be nice to have a test case > > or at least concrete example of this. > > a packetdrill or even a contrived example would be good ... I've seen all but this for sure in packet traces. But I'm somewhat old-school that while looking for the burst issue I discovered this issue by reading the code only (making it more than _one_ issue). However, I think that I later on saw also this issue from the traces (as it seemed to not match to any of the other burst issues this whole series is trying to fix). But finding that dump afterwards would take really long time, I've more than enough of them from our recent tests ;-)). But anyway, that was before the recent moving for the condition into tp->frto block so it might no longer be triggerable. It clearly was triggerable beforehand without tp->frto guard (and I just forward-ported past that recent change without thinking it much). To trigger it, ever-R and !ever-R skb would need to be cumulatively ACKed when tp->frto is non-zero. Do you think that is still possible with FRTO? E.g., after some undo leaving some ever-R and then RTO resulting in FRTO procedure? > also why not just avoid setting FLAG_ORIG_SACK_ACKED on non-sack? seems > a much clean fix. I guess that would work now that the relevant FRTO condition got moved into the tp->frto block. It wouldn't have been that simple earlier as SACK wanted FLAG_ORIG_SACK_ACKED while non-SACK wants FLAG_ONLY_ORIG_ACKED (that was already available through a combination of the existing FLAGs). -- i.
Re: [PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
On Wed, 7 Mar 2018, Eric Dumazet wrote: > > Currently non-SACK senders cannot identify almost any duplicate ACKs > > because the window keeps updating for almost all ACKs. As a result, > > non-SACK senders end up doing loss recovery only with RTO. RTO > > recovery > > without SACK is quite annoying because it potentially sends > > large number of unnecessary rexmits. > > I get that, but switching from "always" to "never" sounds dangerous. > > May I suggest you refine your patch, to maybe allow win reductions, in > a logarithmic fashion maybe ? > > This way, when you send 1000 DUPACK, only few of them would actually > have to lower the window, and 99% of them would be considered as DUPACK > by these prehistoric TCP stacks. The problem I'm trying to fix is due to window increasing (of course, dupacks could also be masked because of decrease but that's something I've never seen to occur in practice) so I tried to make you happy and change my fix to do "never increase, always decrease". However, it turns out that even pre-fixed code implements "never decrease" not "always decrease" like you thought: static u16 tcp_select_window(struct sock *sk) { ... /* Never shrink the offered window */ if (new_win < cur_win) { ... new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); ...Thus, I don't see my fix making the case you're worried about any worse, AFAICT. -- i.
Re: [PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
Thanks for taking a look. On Wed, 7 Mar 2018, Eric Dumazet wrote: > On Wed, 2018-03-07 at 14:59 +0200, Ilpo Järvinen wrote: > > Currently, the TCP code is overly eager to update window on > > every ACK. It makes some of the ACKs that the receiver should > > sent as dupACKs look like they update window update that are > > not considered real dupACKs by the non-SACK sender-side code. > > > > Make sure that when an ofo segment is received, no change to > > window is applied if we are going to send a dupACK. It's ok > > to change the window for non-dupACKs (such as the first ACK > > after ofo arrivals start if that ACK was using delayed ACKs). > > This looks dangerous to me. > > We certainly want to lower the window at some point, or risk expensive > pruning and/or drops. I see you're conspiring for "treason" (if you recall those charmy times) ;-). > It is not clear by reading your changelog/patch, but it looks like some > malicious peers could hurt us. The malicious peers can certainly send out-of-window data already at will (with all of such packets being dropped regardless of that being expensive or not) so I don't see there's a big change for malicious case really. The window is only locked for what we've already promised for in an earlier ACK! ...Previously, reneging that promise (advertising smaller than the previous value) was called "treason" by us (unfortunately, the message is no longer as charmy). Even with this change, the window is free to change when the ack field is updated which for legimate senders occurs typically once per RTT. > By current standards, non TCP sack flows are not worth any potential > issues. Currently non-SACK senders cannot identify almost any duplicate ACKs because the window keeps updating for almost all ACKs. As a result, non-SACK senders end up doing loss recovery only with RTO. RTO recovery without SACK is quite annoying because it potentially sends large number of unnecessary rexmits. -- i.
[PATCH net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()
No functional changes. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1a33752..e20f9ad 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2211,6 +2211,19 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) } } +/* False fast retransmits may occur when SACK is not in use under certain + * conditions (RFC6582). The sender MUST hold old state until something + * *above* high_seq is ACKed to prevent triggering such false fast + * retransmits. SACK TCP is safe. + */ +static bool tcp_false_fast_retrans_possible(const struct sock *sk, + const u32 snd_una) +{ + const struct tcp_sock *tp = tcp_sk(sk); + + return ((snd_una == tp->high_seq) && tcp_is_reno(tp)); +} + static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when) { return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && @@ -2350,10 +2363,10 @@ static bool tcp_try_undo_recovery(struct sock *sk) } else if (tp->rack.reo_wnd_persist) { tp->rack.reo_wnd_persist--; } - if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) { - /* Hold old state until something *above* high_seq -* is ACKed. For Reno it is MUST to prevent false -* fast retransmits (RFC2582). SACK TCP is safe. */ + if (tcp_false_fast_retrans_possible(sk, tp->snd_una)) { + /* Hold old state until something *above* high_seq is ACKed +* if false fast retransmit is possible. +*/ if (!tcp_any_retrans_done(sk)) tp->retrans_stamp = 0; return true; -- 2.7.4
[PATCH net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
The pkts_acked is clearly not the correct value for slow start after RTO as it may include segments that were not lost and therefore did not need retransmissions in the slow start following the RTO. Then tcp_slow_start will add the excess into cwnd bloating it and triggering a burst. Instead, we want to pass only the number of retransmitted segments that were covered by the cumulative ACK (and potentially newly sent data segments too if the cumulative ACK covers that far). Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9a1b3c1..0305f6d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, long seq_rtt_us = -1L; long ca_rtt_us = -1L; u32 pkts_acked = 0; + u32 rexmit_acked = 0; + u32 newdata_acked = 0; u32 last_in_flight = 0; bool rtt_update; int flag = 0; @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, } if (unlikely(sacked & TCPCB_RETRANS)) { - if (sacked & TCPCB_SACKED_RETRANS) + if (sacked & TCPCB_SACKED_RETRANS) { tp->retrans_out -= acked_pcount; + rexmit_acked += acked_pcount; + } flag |= FLAG_RETRANS_DATA_ACKED; } else if (!(sacked & TCPCB_SACKED_ACKED)) { last_ackt = skb->skb_mstamp; @@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, last_in_flight = TCP_SKB_CB(skb)->tx.in_flight; if (before(start_seq, reord)) reord = start_seq; - if (!after(scb->end_seq, tp->high_seq)) + if (!after(scb->end_seq, tp->high_seq)) { flag |= FLAG_ORIG_SACK_ACKED; + } else { + newdata_acked += acked_pcount; + } } if (sacked & TCPCB_SACKED_ACKED) { @@ -3151,6 +3158,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, } if (tcp_is_reno(tp)) { + /* Due to discontinuity on RTO in the artificial +* sacked_out calculations, TCP must restrict +* pkts_acked without SACK to rexmits and new data +* segments +*/ + if (icsk->icsk_ca_state == TCP_CA_Loss) + pkts_acked = rexmit_acked + newdata_acked; + tcp_remove_reno_sacks(sk, pkts_acked); } else { int delta; -- 2.7.4
[PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled
A bogus undo may/will trigger when the loss recovery state is kept until snd_una is above high_seq. If tcp_any_retrans_done is zero, retrans_stamp is cleared in this transient state. On the next ACK, tcp_try_undo_recovery again executes and tcp_may_undo will always return true because tcp_packet_delayed has this condition: return !tp->retrans_stamp || ... Check for the false fast retransmit transient condition in tcp_packet_delayed to avoid bogus undos. Since snd_una may have advanced on this ACK but CA state still remains unchanged, prior_snd_una needs to be passed instead of tp->snd_una. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 50 ++ 1 file changed, 34 insertions(+), 16 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e20f9ad..b689915 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -98,6 +98,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE; #define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */ #define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_send_challenge_ack() */ #define FLAG_ACK_MAYBE_DELAYED 0x1 /* Likely a delayed ACK */ +#define FLAG_PACKET_DELAYED0x2 /* 0 rexmits or tstamps reveal delayed pkt */ #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED) #define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED) @@ -2243,10 +2244,19 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp, /* Nothing was retransmitted or returned timestamp is less * than timestamp of the first retransmission. */ -static inline bool tcp_packet_delayed(const struct tcp_sock *tp) +static inline bool tcp_packet_delayed(const struct sock *sk, + const u32 prior_snd_una) { - return !tp->retrans_stamp || - tcp_tsopt_ecr_before(tp, tp->retrans_stamp); + const struct tcp_sock *tp = tcp_sk(sk); + + if (!tp->retrans_stamp) { + /* Sender will be in a transient state with cleared +* retrans_stamp during false fast retransmit prevention +* mechanism +*/ + return !tcp_false_fast_retrans_possible(sk, prior_snd_una); + } + return tcp_tsopt_ecr_before(tp, tp->retrans_stamp); } /* Undo procedures. */ @@ -2336,17 +2346,20 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss) tp->rack.advanced = 1; /* Force RACK to re-exam losses */ } -static inline bool tcp_may_undo(const struct tcp_sock *tp) +static inline bool tcp_may_undo(const struct sock *sk, const int flag) { - return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp)); + const struct tcp_sock *tp = tcp_sk(sk); + + return tp->undo_marker && + (!tp->undo_retrans || (flag & FLAG_PACKET_DELAYED)); } /* People celebrate: "We love our President!" */ -static bool tcp_try_undo_recovery(struct sock *sk) +static bool tcp_try_undo_recovery(struct sock *sk, const int flag) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_may_undo(tp)) { + if (tcp_may_undo(sk, flag)) { int mib_idx; /* Happy end! We did not retransmit anything @@ -2393,11 +2406,11 @@ static bool tcp_try_undo_dsack(struct sock *sk) } /* Undo during loss recovery after partial ACK or using F-RTO. */ -static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo) +static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo, const int flag) { struct tcp_sock *tp = tcp_sk(sk); - if (frto_undo || tcp_may_undo(tp)) { + if (frto_undo || tcp_may_undo(sk, flag)) { tcp_undo_cwnd_reduction(sk, true); DBGUNDO(sk, "partial loss"); @@ -2636,7 +2649,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, bool recovered = !before(tp->snd_una, tp->high_seq); if ((flag & FLAG_SND_UNA_ADVANCED) && - tcp_try_undo_loss(sk, false)) + tcp_try_undo_loss(sk, false, flag)) return; if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ @@ -2649,7 +2662,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, */ if ((flag & FLAG_ORIG_SACK_ACKED) && (tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) && - tcp_try_undo_loss(sk, true)) + tcp_try_undo_loss(sk, true, flag)) return; if (after(tp->snd_nxt, tp->high_seq)) { @@ -2672,7 +2685,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, if (recovered) { /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part
[PATCH net 0/5] tcp: fixes to non-SACK TCP
Here is a series of fixes to issues that occur when SACK is not enabled for a TCP connection: tcp: feed correct number of pkts acked to cc modules tcp: prevent bogus FRTO undos with non-SACK flows tcp: move false FR condition into tcp: prevent bogus undos when SACK is not enabled tcp: send real dupACKs by locking advertized window
[PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
Currently, the TCP code is overly eager to update window on every ACK. It makes some of the ACKs that the receiver should sent as dupACKs look like they update window update that are not considered real dupACKs by the non-SACK sender-side code. Make sure that when an ofo segment is received, no change to window is applied if we are going to send a dupACK. It's ok to change the window for non-dupACKs (such as the first ACK after ofo arrivals start if that ACK was using delayed ACKs). Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- include/linux/tcp.h | 3 ++- net/ipv4/tcp_input.c | 5 - net/ipv4/tcp_output.c | 43 +-- 3 files changed, 31 insertions(+), 20 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 8f4c549..e239662 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -225,7 +225,8 @@ struct tcp_sock { fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */ fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */ is_sack_reneg:1,/* in recovery from loss with SACK reneg? */ - unused:2; + dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */ + unused:1; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto: 1,/* Use linear timeouts for thin streams */ unused1 : 1, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b689915..77a289f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4633,6 +4633,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size) static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); bool fragstolen; int eaten; @@ -4676,7 +4677,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) * gap in queue is filled. */ if (RB_EMPTY_ROOT(>out_of_order_queue)) - inet_csk(sk)->icsk_ack.pingpong = 0; + icsk->icsk_ack.pingpong = 0; } if (tp->rx_opt.num_sacks) @@ -4726,6 +4727,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) goto queue_and_out; } + if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER)) + tp->dupack_wnd_lock = 1; tcp_data_queue_ofo(sk, skb); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 6818042..45fbe92 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); u32 old_win = tp->rcv_wnd; - u32 cur_win = tcp_receive_window(tp); - u32 new_win = __tcp_select_window(sk); - - /* Never shrink the offered window */ - if (new_win < cur_win) { - /* Danger Will Robinson! -* Don't update rcv_wup/rcv_wnd here or else -* we will not be able to advertise a zero -* window in time. --DaveM -* -* Relax Will Robinson. -*/ - if (new_win == 0) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPWANTZEROWINDOWADV); - new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); + u32 cur_win; + u32 new_win; + + if (tp->dupack_wnd_lock) { + new_win = old_win; + tp->dupack_wnd_lock = 0; + } else { + cur_win = tcp_receive_window(tp); + new_win = __tcp_select_window(sk); + /* Never shrink the offered window */ + if (new_win < cur_win) { + /* Danger Will Robinson! +* Don't update rcv_wup/rcv_wnd here or else +* we will not be able to advertise a zero +* window in time. --DaveM +* +* Relax Will Robinson. +*/ + if (new_win == 0) + NET_INC_STATS(sock_net(sk), + LINUX_MIB_TCPWANTZEROWINDOWADV); + new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); + } + tp->rcv_wnd = new_win; + tp->rcv_wup = tp->rcv_nxt; } - tp->rcv_wnd = new_win; - tp->rcv_wup = tp->rcv_nxt; /* Make sure we do not exceed the maximum possible * scaled window. -- 2.7.4
[PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
In a non-SACK case, any non-retransmitted segment acknowledged will set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is no indication that it would have been delivered for real (the scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK case). This causes bogus undos in ordinary RTO recoveries where segments are lost here and there, with a few delivered segments in between losses. A cumulative ACKs will cover retransmitted ones at the bottom and the non-retransmitted ones following that causing FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results in a spurious FRTO undo. We need to make the check more strict for non-SACK case and check that none of the cumulatively ACKed segments were retransmitted, which would be the case for the last step of FRTO algorithm as we sent out only new segments previously. Only then, allow FRTO undo to proceed in non-SACK case. Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> --- net/ipv4/tcp_input.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0305f6d..1a33752 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2629,8 +2629,13 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ /* Step 3.b. A timeout is spurious if not all data are * lost, i.e., never-retransmitted data are (s)acked. +* +* As the non-SACK case does not keep track of TCPCB_SACKED_ACKED, +* we need to ensure that none of the cumulative ACKed segments +* was retransmitted to confirm the validity of FLAG_ORIG_SACK_ACKED. */ if ((flag & FLAG_ORIG_SACK_ACKED) && + (tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) && tcp_try_undo_loss(sk, true)) return; -- 2.7.4
Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue
) { > - struct sk_buff *next = NULL; > > - if (!skb_queue_is_last(>out_of_order_queue, skb)) > - next = skb_queue_next(>out_of_order_queue, skb); > - skb = next; > + for (head = skb;;) { > + skb = tcp_skb_next(skb, NULL); > > - /* Segment is terminated when we see gap or when > - * we are at the end of all the queue. */ > + /* Range is terminated when we see a gap or when > + * we are at the queue end. > + */ > if (!skb || > after(TCP_SKB_CB(skb)->seq, end) || > before(TCP_SKB_CB(skb)->end_seq, start)) { > - tcp_collapse(sk, >out_of_order_queue, > + tcp_collapse(sk, NULL, >out_of_order_queue, >head, skb, start, end); > - head = skb; > - if (!skb) > - break; > - /* Start new segment */ > + goto new_range; > + } > + > + if (unlikely(before(TCP_SKB_CB(skb)->seq, start))) > start = TCP_SKB_CB(skb)->seq; > + if (after(TCP_SKB_CB(skb)->end_seq, end)) > end = TCP_SKB_CB(skb)->end_seq; > - } else { > - if (before(TCP_SKB_CB(skb)->seq, start)) > - start = TCP_SKB_CB(skb)->seq; > - if (after(TCP_SKB_CB(skb)->end_seq, end)) > - end = TCP_SKB_CB(skb)->end_seq; > - } > } > } I tried long to think if I could propose alternative layout which would make this function to exit from the end but couldn't come up anything sensible. As is, it's always exiting within that top if block which is somewhat unintuitive :-). Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki> -- i.
Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)
On Tue, 8 Dec 2015, Per Hurtig wrote: > This patch implements the RTO restart modification (RTOR). When data is > ACKed, and the RTO timer is restarted, the time elapsed since the last > outstanding segment was transmitted is subtracted from the calculated RTO > value. This way, the RTO timer will expire after exactly RTO seconds, and > not RTO + RTT [+ delACK] seconds. > > This patch also implements a new sysctl (tcp_timer_restart) that is used > to control the timer restart behavior. > > Signed-off-by: Per Hurtig> --- > Documentation/networking/ip-sysctl.txt | 12 > include/net/tcp.h | 6 ++ > net/ipv4/sysctl_net_ipv4.c | 10 ++ > net/ipv4/tcp_input.c | 29 + > 4 files changed, 57 insertions(+) > > diff --git a/Documentation/networking/ip-sysctl.txt > b/Documentation/networking/ip-sysctl.txt > index 2ea4c45..4094128 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER > with the current initial RTO of 1second. With this the final timeout > for an active TCP connection attempt will happen after 127seconds. > > +tcp_timer_restart - INTEGER > + Controls how the RTO and PTO timers are restarted (RTOR and TLPR). > + If set (per timer or combined) the timers are restarted with > + respect to the earliest outstanding segment, to not extend tail loss > + latency unnecessarily. > + Possible values: > + 0 disables RTOR and TLPR. > + 1 enables RTOR. > + 2 enables TLPR. > + 3 enables RTOR and TLPR. > + Default: 3 > + > tcp_timestamps - BOOLEAN > Enable timestamps as defined in RFC1323. > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index f80e74c..833efb7 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -76,6 +76,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); > /* After receiving this amount of duplicate ACKs fast retransmit starts. */ > #define TCP_FASTRETRANS_THRESH 3 > > +/* Disable RTO Restart if the number of outstanding segments is at least. */ > +#define TCP_TIMER_RTORESTART 1 > +#define TCP_TIMER_TLPRESTART 2 > +#define TCP_RTORESTART_THRESH4 Unfortunately the comment got now separated from the actual define. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)
On Mon, 7 Dec 2015, Per Hurtig wrote: > This patch implements the RTO restart modification (RTOR). When data is > ACKed, and the RTO timer is restarted, the time elapsed since the last > outstanding segment was transmitted is subtracted from the calculated RTO > value. This way, the RTO timer will expire after exactly RTO seconds, and > not RTO + RTT [+ delACK] seconds. > > This patch also implements a new sysctl (tcp_timer_restart) that is used > to control the timer restart behavior. > > Signed-off-by: Per Hurtig> --- > Documentation/networking/ip-sysctl.txt | 12 > include/net/tcp.h | 4 > net/ipv4/sysctl_net_ipv4.c | 10 ++ > net/ipv4/tcp_input.c | 24 > 4 files changed, 50 insertions(+) > > diff --git a/Documentation/networking/ip-sysctl.txt > b/Documentation/networking/ip-sysctl.txt > index 2ea4c45..4094128 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER > with the current initial RTO of 1second. With this the final timeout > for an active TCP connection attempt will happen after 127seconds. > > +tcp_timer_restart - INTEGER > + Controls how the RTO and PTO timers are restarted (RTOR and TLPR). > + If set (per timer or combined) the timers are restarted with > + respect to the earliest outstanding segment, to not extend tail loss > + latency unnecessarily. > + Possible values: > + 0 disables RTOR and TLPR. > + 1 enables RTOR. > + 2 enables TLPR. > + 3 enables RTOR and TLPR. > + Default: 3 > + > tcp_timestamps - BOOLEAN > Enable timestamps as defined in RFC1323. > [...snip...] > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index fdd88c3..66e0425 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c [...snip...] > /* Restart timer after forward progress on connection. > * RFC2988 recommends to restart timer to now+rto. > */ > @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk) >*/ > if (delta > 0) > rto = delta; > + } else if (icsk->icsk_pending == ICSK_TIME_RETRANS && > +(sysctl_tcp_timer_restart == 1 || > + sysctl_tcp_timer_restart == 3) && Use a bit operation here instead? Also I think that this sysctl would benefit from named constants rather than use of literals (similar comment applies to the other patch too). > +(tp->packets_out + tcp_unsent_pkts(sk) < > + TCP_RTORESTART_THRESH)) { > + struct sk_buff *skb = tcp_write_queue_head(sk); > + const u32 rto_time_stamp = tcp_skb_timestamp(skb); > + s32 delta = (s32)(tcp_time_stamp - rto_time_stamp); > + > + if (delta > 0 && rto > delta) > + rto -= delta; > } > inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, > TCP_RTO_MAX); -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/
On Sat, 23 Feb 2008, Andrew Morton wrote: On Wed, 20 Feb 2008 15:47:18 +0200 Ilpo Järvinen [EMAIL PROTECTED] wrote: vmlinux.o: 62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869 ...+ these to lib/jhash.o: jhash_3words: 112 jhash2: 276 jhash: 475 select for networking code might need a more fine-grained approach. It should be possible to use a modular jhash.ko. The things which you have identified as clients of the jhash library are usually loaded as modules. But in the case where someone does (say) NFSD=y we do need jhash.o linked into vmlinux also. This is doable in Kconfig but I always forget how. Ok, even though its not that likely that one lives without e.g. net/ipv4/inet_connection_sock.c or net/netlink/af_netlink.c? But maybe some guys really know what they are doing and can come up with config that would be able to build it as module (for other than proof-of-concept uses I mean)... :-/ Adrian, Sam and Randy are the repositories of knowledge here ;) Thanks, I'll consult them in this. I've never needed to do any Kconfig stuff so far so it's no surprise I have very little clue... :-) I've one question for you Andrew, how would you like this kind of cross-subsys toucher to be merged, through you directly I suppose? -- i.
Re: [RFC PATCH 0/8]: uninline uninline
On Sat, 23 Feb 2008, Andrew Morton wrote: On Wed, 20 Feb 2008 15:47:10 +0200 Ilpo J__rvinen [EMAIL PROTECTED] wrote: -41525 2066 f, 3370 +, 44895 -, diff: -41525 IS_ERR This is a surprise. It surprised me as well, there were something like 10 bytes I just couldn't explain in IS_ERR size (kernel/uninlined.c: IS_ERR | +25). I was to look into it deeper but didn't have the .o's at hand right away, not so trivial to store results of 5000+ build results except some carefully picked textual output :-)... Hmm, I'll add logging for the disassembly of the uninlined stuff into the next run, that won't cost too much... I expect that the -mm-only profile-likely-unlikely-macros.patch is the cause of this and mainline doesn't have this problem. Ahaa, this explain it, I suspected that there was something (un)likely related elsewhere as well, no wonder why I couldn't reproduce the 25 bytes result in my quick copy-pasted non-kernel test... If true, then this likely/unlikely bloat has probably spread into a lot of your other results and it all should be redone against mainline, sorry :( It isn't that troublesome to redo them, it's mainly automatic combined with impatient waiting from my behalf :-)... The spreading problem is probably true, to some extent. I did some runs also with carefully selected CONFIG.*DEBUG.* off under include/net/ earlier, in general it made very little difference, if something bloats, it usually does that regardless of .config. There are probably couple of exceptions when the size is on the boundary where it's very close of being useful to uninline it. One interesting thing in there was that the largest offenders are quite small per call-site but due to vast number of them even a small benefit buys off a lot in kernel wide results. I suspect the differences due to (un)likely will be negligle, because the IS_ERR with all its trivialness is now mostly -15, so anything clearly larger than that will likely still be a win x n (where n is quite large). Anyway, I'll see when I get the new set of tests running... :-) I'd prefer with CONFIG.*DEBUG off to get larger picture about non-debug builds too (especially the scatterlist.h things interest me a lot), do you think that would do as well? Thanks for your comments, I found them very useful. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/8]: uninline uninline
On Sat, 23 Feb 2008, Andi Kleen wrote: Andrew Morton [EMAIL PROTECTED] writes: -41525 2066 f, 3370 +, 44895 -, diff: -41525 IS_ERR This is a surprise. I expect that the -mm-only profile-likely-unlikely-macros.patch is the cause of this and mainline doesn't have this problem. Shouldn't they only have overhead when the respective CONFIG is enabled? I uninlined those with allyesconfig. I'll anyway try later on without a number of debug related CONFIGs. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/8]: uninline uninline
Hi all, I run some lengthy tests to measure cost of inlines in headers under include/, simple coverage calculations yields to 89% but most of the failed compiles are due to preprocessor cutting the tested block away anyway. Test setup: v2.6.24-mm1, make allyesconfig, 32-bit x86, gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Because one inline was tested (function uninlined) at a time, the actual benefits of removing multiple inlines may well be below what the sum of those individually is (especially when something calls __-func with equal name). Ok, here's the top of the list (1+ bytes): -110805 869 f, 198 +, 111003 -, diff: -110805 skb_put -41525 2066 f, 3370 +, 44895 -, diff: -41525 IS_ERR -36290 42 f, 197 +, 36487 -, diff: -36290 cfi_build_cmd -35698 1234 f, 2391 +, 38089 -, diff: -35698 atomic_dec_and_test -28162 354 f, 3005 +, 31167 -, diff: -28162 skb_pull -23668 392 f, 104 +, 23772 -, diff: -23668 dev_alloc_skb -22212 415 f, 130 +, 22342 -, diff: -22212 __dev_alloc_skb -21593 356 f, 2418 +, 24011 -, diff: -21593 skb_push -19036 478 f, 259 +, 19295 -, diff: -19036 netif_wake_queue -18409 396 f, 6447 +, 24856 -, diff: -18409 __skb_pull -16420 187 f, 103 +, 16523 -, diff: -16420 dst_release -16025 13 f, 280 +, 16305 -, diff: -16025 cfi_send_gen_cmd -14925 486 f, 978 +, 15903 -, diff: -14925 add_timer -14896 199 f, 558 +, 15454 -, diff: -14896 sg_page -12870 36 f, 121 +, 12991 -, diff: -12870 le_key_k_type -12310 692 f, 7215 +, 19525 -, diff: -12310 signal_pending -11640 251 f, 118 +, 11758 -, diff: -11640 __skb_trim -11059 111 f, 293 +, 11352 -, diff: -11059 __nlmsg_put -10976 209 f, 123 +, 11099 -, diff: -10976 skb_trim -10344 125 f, 462 +, 10806 -, diff: -10344 pskb_may_pull -10061 300 f, 1163 +, 11224 -, diff: -10061 try_module_get -10008 75 f, 341 +, 10349 -, diff: -10008 nlmsg_put ~250 are in 1000+ bytes category and ~440 in 500+. Full list has some entries without number because given config doesn't build them, and therefore nothing got uninlined, and the missing entries consists solely of compile failures, available here: http://www.cs.helsinki.fi/u/ijjarvin/inlines/sorted I made some patches to uninline couple of them (picked mostly net related) to stir up some discussion, however, some of them are not ready for inclusion as is (see patch descriptions). The cases don't represent all top 8 cases because some of the cases require a bit more analysis (e.g., config dependant, comments about gcc optimizations). The tools I used are available here except the site-specific distribute machinery (in addition one needs pretty late codiff from Arnaldo's toolset because there were some inline related bugs fixed lately): http://www.cs.helsinki.fi/u/ijjarvin/inline-tools.git/ I'm planning to run similar tests also on inlines in headers that are not under include/ but it requires minor modifications to those tools. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/8]: uninline uninline
Hi all, I run some lengthy tests to measure cost of inlines in headers under include/, simple coverage calculations yields to 89% but most of the failed compiles are due to preprocessor cutting the tested block away anyway. Test setup: v2.6.24-mm1, make allyesconfig, 32-bit x86, gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Because one inline was tested (function uninlined) at a time, the actual benefits of removing multiple inlines may well be below what the sum of those individually is (especially when something calls __-func with equal name). Ok, here's the top of the list (1+ bytes): -110805 869 f, 198 +, 111003 -, diff: -110805 skb_put -41525 2066 f, 3370 +, 44895 -, diff: -41525 IS_ERR -36290 42 f, 197 +, 36487 -, diff: -36290 cfi_build_cmd -35698 1234 f, 2391 +, 38089 -, diff: -35698 atomic_dec_and_test -28162 354 f, 3005 +, 31167 -, diff: -28162 skb_pull -23668 392 f, 104 +, 23772 -, diff: -23668 dev_alloc_skb -22212 415 f, 130 +, 22342 -, diff: -22212 __dev_alloc_skb -21593 356 f, 2418 +, 24011 -, diff: -21593 skb_push -19036 478 f, 259 +, 19295 -, diff: -19036 netif_wake_queue -18409 396 f, 6447 +, 24856 -, diff: -18409 __skb_pull -16420 187 f, 103 +, 16523 -, diff: -16420 dst_release -16025 13 f, 280 +, 16305 -, diff: -16025 cfi_send_gen_cmd -14925 486 f, 978 +, 15903 -, diff: -14925 add_timer -14896 199 f, 558 +, 15454 -, diff: -14896 sg_page -12870 36 f, 121 +, 12991 -, diff: -12870 le_key_k_type -12310 692 f, 7215 +, 19525 -, diff: -12310 signal_pending -11640 251 f, 118 +, 11758 -, diff: -11640 __skb_trim -11059 111 f, 293 +, 11352 -, diff: -11059 __nlmsg_put -10976 209 f, 123 +, 11099 -, diff: -10976 skb_trim -10344 125 f, 462 +, 10806 -, diff: -10344 pskb_may_pull -10061 300 f, 1163 +, 11224 -, diff: -10061 try_module_get -10008 75 f, 341 +, 10349 -, diff: -10008 nlmsg_put ~250 are in 1000+ bytes category and ~440 in 500+. Full list has some entries without number because given config doesn't build them, and therefore nothing got uninlined, and the missing entries consists solely of compile failures, available here: http://www.cs.helsinki.fi/u/ijjarvin/inlines/sorted I made some patches to uninline couple of them (picked mostly net related) to stir up some discussion, however, some of them are not ready for inclusion as is (see patch descriptions). The cases don't represent all top 8 cases because some of the cases require a bit more analysis (e.g., config dependant, comments about gcc optimizations). The tools I used are available here except the site-specific distribute machinery (in addition one needs pretty late codiff from Arnaldo's toolset because there were some inline related bugs fixed lately): http://www.cs.helsinki.fi/u/ijjarvin/inline-tools.git/ I'm planning to run similar tests also on inlines in headers that are not under include/ but it requires minor modifications to those tools. -- i. ps. I'm sorry about the duplicates, old git-send-email's 8-bit-header problem bit me again. :-( -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot
~500 files changed ... kernel/uninlined.c: skb_put | +104 1 function changed, 104 bytes added, diff: +104 vmlinux.o: 869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805 This change is INCOMPLETE, I think that the call to current_text_addr() should be rethought but I don't have a clue how to do that. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/skbuff.h | 20 +--- net/core/skbuff.c | 21 + 2 files changed, 22 insertions(+), 19 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 412672a..5925435 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -896,25 +896,7 @@ static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len) return tmp; } -/** - * skb_put - add data to a buffer - * @skb: buffer to use - * @len: amount of data to add - * - * This function extends the used data area of the buffer. If this would - * exceed the total buffer size the kernel will panic. A pointer to the - * first byte of the extra data is returned. - */ -static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len) -{ - unsigned char *tmp = skb_tail_pointer(skb); - SKB_LINEAR_ASSERT(skb); - skb-tail += len; - skb-len += len; - if (unlikely(skb-tail skb-end)) - skb_over_panic(skb, len, current_text_addr()); - return tmp; -} +extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len); static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4e35422..661d439 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -857,6 +857,27 @@ free_skb: return err; } +/** + * skb_put - add data to a buffer + * @skb: buffer to use + * @len: amount of data to add + * + * This function extends the used data area of the buffer. If this would + * exceed the total buffer size the kernel will panic. A pointer to the + * first byte of the extra data is returned. + */ +unsigned char *skb_put(struct sk_buff *skb, unsigned int len) +{ + unsigned char *tmp = skb_tail_pointer(skb); + SKB_LINEAR_ASSERT(skb); + skb-tail += len; + skb-len += len; + if (unlikely(skb-tail skb-end)) + skb_over_panic(skb, len, current_text_addr()); + return tmp; +} +EXPORT_SYMBOL(skb_put); + /* Trims skb to length len. It can change skb pointers. */ -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 4/8] [NET]: uninline skb_push, de-bloats a lot
-21593 356 funcs, 2418 +, 24011 -, diff: -21593 --- skb_push Again, current_text_addr() needs to addressed. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/skbuff.h | 18 +- net/core/skbuff.c | 19 +++ 2 files changed, 20 insertions(+), 17 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index df3cce2..c11f248 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -905,23 +905,7 @@ static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len) return skb-data; } -/** - * skb_push - add data to the start of a buffer - * @skb: buffer to use - * @len: amount of data to add - * - * This function extends the used data area of the buffer at the buffer - * start. If this would exceed the total buffer headroom the kernel will - * panic. A pointer to the first byte of the extra data is returned. - */ -static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len) -{ - skb-data -= len; - skb-len += len; - if (unlikely(skb-dataskb-head)) - skb_under_panic(skb, len, current_text_addr()); - return skb-data; -} +extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len); static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 081bffb..05d43fd 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -897,6 +897,25 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int len) EXPORT_SYMBOL(skb_put); /** + * skb_push - add data to the start of a buffer + * @skb: buffer to use + * @len: amount of data to add + * + * This function extends the used data area of the buffer at the buffer + * start. If this would exceed the total buffer headroom the kernel will + * panic. A pointer to the first byte of the extra data is returned. + */ +unsigned char *skb_push(struct sk_buff *skb, unsigned int len) +{ + skb-data -= len; + skb-len += len; + if (unlikely(skb-dataskb-head)) + skb_under_panic(skb, len, current_text_addr()); + return skb-data; +} +EXPORT_SYMBOL(skb_push); + +/** * skb_pull - remove data from the start of a buffer * @skb: buffer to use * @len: amount of data to remove -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot
-23668 392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/skbuff.h | 17 + net/core/skbuff.c | 18 ++ 2 files changed, 19 insertions(+), 16 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a9f8f15..df3cce2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1269,22 +1269,7 @@ static inline struct sk_buff *__dev_alloc_skb(unsigned int length, return skb; } -/** - * dev_alloc_skb - allocate an skbuff for receiving - * @length: length to allocate - * - * Allocate a new sk_buff and assign it a usage count of one. The - * buffer has unspecified headroom built in. Users should allocate - * the headroom they think they need without accounting for the - * built in space. The built in space is used for optimisations. - * - * %NULL is returned if there is no free memory. Although this function - * allocates memory it can be called from an interrupt. - */ -static inline struct sk_buff *dev_alloc_skb(unsigned int length) -{ - return __dev_alloc_skb(length, GFP_ATOMIC); -} +extern struct sk_buff *dev_alloc_skb(unsigned int length); extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, gfp_t gfp_mask); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 14f462b..081bffb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, return skb; } +/** + * dev_alloc_skb - allocate an skbuff for receiving + * @length: length to allocate + * + * Allocate a new sk_buff and assign it a usage count of one. The + * buffer has unspecified headroom built in. Users should allocate + * the headroom they think they need without accounting for the + * built in space. The built in space is used for optimisations. + * + * %NULL is returned if there is no free memory. Although this function + * allocates memory it can be called from an interrupt. + */ +struct sk_buff *dev_alloc_skb(unsigned int length) +{ + return __dev_alloc_skb(length, GFP_ATOMIC); +} +EXPORT_SYMBOL(dev_alloc_skb); + static void skb_drop_list(struct sk_buff **listp) { struct sk_buff *list = *listp; -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/8] [NET]: uninline skb_pull, de-bloats a lot
-28162 354 funcs, 3005 +, 31167 -, diff: -28162 --- skb_pull Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/skbuff.h | 15 +-- net/core/skbuff.c | 16 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 5925435..a9f8f15 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -930,20 +930,7 @@ static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len) return skb-data += len; } -/** - * skb_pull - remove data from the start of a buffer - * @skb: buffer to use - * @len: amount of data to remove - * - * This function removes data from the start of a buffer, returning - * the memory to the headroom. A pointer to the next data in the buffer - * is returned. Once the data has been pulled future pushes will overwrite - * the old data. - */ -static inline unsigned char *skb_pull(struct sk_buff *skb, unsigned int len) -{ - return unlikely(len skb-len) ? NULL : __skb_pull(skb, len); -} +extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len); extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 661d439..14f462b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -878,6 +878,22 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int len) } EXPORT_SYMBOL(skb_put); +/** + * skb_pull - remove data from the start of a buffer + * @skb: buffer to use + * @len: amount of data to remove + * + * This function removes data from the start of a buffer, returning + * the memory to the headroom. A pointer to the next data in the buffer + * is returned. Once the data has been pulled future pushes will overwrite + * the old data. + */ +unsigned char *skb_pull(struct sk_buff *skb, unsigned int len) +{ + return unlikely(len skb-len) ? NULL : __skb_pull(skb, len); +} +EXPORT_SYMBOL(skb_pull); + /* Trims skb to length len. It can change skb pointers. */ -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 6/8] [NET]: uninline skb_trim, de-bloats
-10976 209 funcs, 123 +, 11099 -, diff: -10976 --- skb_trim Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/skbuff.h | 16 +--- net/core/skbuff.c | 16 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c11f248..75d8a66 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1156,21 +1156,7 @@ static inline void __skb_trim(struct sk_buff *skb, unsigned int len) skb_set_tail_pointer(skb, len); } -/** - * skb_trim - remove end from a buffer - * @skb: buffer to alter - * @len: new length - * - * Cut the length of a buffer down by removing data from the tail. If - * the buffer is already under the length specified it is not modified. - * The skb must be linear. - */ -static inline void skb_trim(struct sk_buff *skb, unsigned int len) -{ - if (skb-len len) - __skb_trim(skb, len); -} - +extern void skb_trim(struct sk_buff *skb, unsigned int len); static inline int __pskb_trim(struct sk_buff *skb, unsigned int len) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 05d43fd..b57cadb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -931,6 +931,22 @@ unsigned char *skb_pull(struct sk_buff *skb, unsigned int len) } EXPORT_SYMBOL(skb_pull); +/** + * skb_trim - remove end from a buffer + * @skb: buffer to alter + * @len: new length + * + * Cut the length of a buffer down by removing data from the tail. If + * the buffer is already under the length specified it is not modified. + * The skb must be linear. + */ +void skb_trim(struct sk_buff *skb, unsigned int len) +{ + if (skb-len len) + __skb_trim(skb, len); +} +EXPORT_SYMBOL(skb_trim); + /* Trims skb to length len. It can change skb pointers. */ -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 7/8] [SCTP]: uninline sctp_add_cmd_sf
I added inline to sctp_add_cmd and appropriate comment there to avoid adding another call into the call chain. This works at least with gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Alternatively, __sctp_add_cmd could be introduced to .h. net/sctp/sm_statefuns.c: sctp_sf_cookie_wait_prm_abort | -125 sctp_sf_cookie_wait_prm_shutdown | -75 sctp_sf_do_9_1_prm_abort | -75 sctp_sf_shutdown_sent_prm_abort| -50 sctp_sf_pdiscard | -25 sctp_stop_t1_and_abort | -100 sctp_sf_do_9_2_start_shutdown | -154 __sctp_sf_do_9_1_abort | -50 sctp_send_stale_cookie_err | -29 sctp_sf_abort_violation| -181 sctp_sf_do_9_2_shutdown_ack| -154 sctp_sf_do_9_2_reshutack | -86 sctp_sf_tabort_8_4_8 | -28 sctp_sf_heartbeat | -52 sctp_sf_shut_8_4_5 | -27 sctp_eat_data | -246 sctp_sf_shutdown_sent_abort| -58 sctp_sf_check_restart_addrs| -50 sctp_sf_do_unexpected_init | -110 sctp_sf_sendbeat_8_3 | -107 sctp_sf_unk_chunk | -65 sctp_sf_do_prm_asoc| -129 sctp_sf_do_prm_send| -25 sctp_sf_do_9_2_prm_shutdown| -50 sctp_sf_error_closed | -25 sctp_sf_error_shutdown | -25 sctp_sf_shutdown_pending_prm_abort | -25 sctp_sf_do_prm_requestheartbeat| -28 sctp_sf_do_prm_asconf | -75 sctp_sf_do_6_3_3_rtx | -104 sctp_sf_do_6_2_sack| -25 sctp_sf_t1_init_timer_expire | -133 sctp_sf_t1_cookie_timer_expire | -104 sctp_sf_t2_timer_expire| -161 sctp_sf_t4_timer_expire| -175 sctp_sf_t5_timer_expire| -75 sctp_sf_autoclose_timer_expire | -50 sctp_sf_do_5_2_4_dupcook | -579 sctp_sf_do_4_C | -125 sctp_sf_shutdown_pending_abort | -32 sctp_sf_do_5_1E_ca | -186 sctp_sf_backbeat_8_3 | -27 sctp_sf_cookie_echoed_err | -300 sctp_sf_eat_data_6_2 | -146 sctp_sf_eat_data_fast_4_4 | -125 sctp_sf_eat_sack_6_2 | -29 sctp_sf_operr_notify | -25 sctp_sf_do_9_2_final | -152 sctp_sf_do_asconf | -64 sctp_sf_do_asconf_ack | -284 sctp_sf_eat_fwd_tsn_fast | -160 sctp_sf_eat_auth | -86 sctp_sf_do_5_1B_init | -110 sctp_sf_do_5_1C_ack| -204 sctp_sf_do_9_2_shutdown| -78 sctp_sf_do_ecn_cwr | -24 sctp_sf_do_ecne| -32 sctp_sf_eat_fwd_tsn| -135 sctp_sf_do_5_1D_ce | -197 sctp_sf_beat_8_3 | -28 60 functions changed, 6184 bytes removed, diff: -6184 net/sctp/sm_sideeffect.c: sctp_side_effects | -3873 sctp_do_sm| +3429 2 functions changed, 3429 bytes added, 3873 bytes removed, diff: -444 kernel/uninlined.c: sctp_add_cmd_sf | +35 1 function changed, 35 bytes added, diff: +35 vmlinux.o: 63 functions changed, 3464 bytes added, 10057 bytes removed, diff: -6593 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Cc: Vlad Yasevich [EMAIL PROTECTED] --- include/net/sctp/sm.h |8 ++-- net/sctp/command.c| 12 +++- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h index ef9e7ed..6740b11 100644 --- a/include/net/sctp/sm.h +++ b/include/net/sctp/sm.h @@ -385,13 +385,9 @@ static inline int ADDIP_SERIAL_gte(__u16 s, __u16 t) return (((s) == (t)) || (((t) - (s)) ADDIP_SERIAL_SIGN_BIT)); } - /* Run sctp_add_cmd() generating a BUG() if there is a failure. */ -static inline void sctp_add_cmd_sf(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj) -{ - if (unlikely(!sctp_add_cmd(seq, verb, obj))) - BUG(); -} +extern void sctp_add_cmd_sf(sctp_cmd_seq_t *seq, sctp_verb_t verb, + sctp_arg_t obj); /* Check VTAG of the packet matches the sender's own tag. */ static inline int diff --git a/net/sctp/command.c b/net/sctp/command.c index bb97733..187da2d 100644 --- a/net/sctp/command.c +++ b/net/sctp/command.c @@ -51,8 +51,11 @@ int sctp_init_cmd_seq(sctp_cmd_seq_t *seq) /* Add a command to a sctp_cmd_seq_t. * Return 0 if the command sequence is full. + * + * Inline here is not a mistake, this way sctp_add_cmd_sf doesn't need extra + * calls, size penalty is of insignificant magnitude here */ -int sctp_add_cmd(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj) +inline int sctp_add_cmd(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj) { if (seq-next_free_slot = SCTP_MAX_NUM_COMMANDS) goto fail; @@ -66,6 +69,13 @@ fail: return 0; } +/* Run
[RFC PATCH 8/8] Jhash in too big for inlining, move under lib/
vmlinux.o: 62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869 ...+ these to lib/jhash.o: jhash_3words: 112 jhash2: 276 jhash: 475 select for networking code might need a more fine-grained approach. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- drivers/infiniband/Kconfig |1 + drivers/net/Kconfig|1 + fs/Kconfig |1 + fs/dlm/Kconfig |1 + fs/gfs2/Kconfig|1 + include/linux/jhash.h | 99 + init/Kconfig |2 + lib/Kconfig|6 ++ lib/Makefile |1 + lib/jhash.c| 116 net/Kconfig|1 + 11 files changed, 134 insertions(+), 96 deletions(-) create mode 100644 lib/jhash.c diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index a5dc78a..421ab71 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -2,6 +2,7 @@ menuconfig INFINIBAND tristate InfiniBand support depends on PCI || BROKEN depends on HAS_IOMEM + select JHASH ---help--- Core support for InfiniBand (IB). Make sure to also select any protocols you wish to use as well as drivers for your diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index f337800..8257648 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2496,6 +2496,7 @@ config CHELSIO_T3 tristate Chelsio Communications T3 10Gb Ethernet support depends on PCI select FW_LOADER + select JHASH help This driver supports Chelsio T3-based gigabit and 10Gb Ethernet adapters. diff --git a/fs/Kconfig b/fs/Kconfig index d731282..693fe71 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1667,6 +1667,7 @@ config NFSD select LOCKD select SUNRPC select EXPORTFS + select JHASH select NFSD_V2_ACL if NFSD_V3_ACL select NFS_ACL_SUPPORT if NFSD_V2_ACL select NFSD_TCP if NFSD_V4 diff --git a/fs/dlm/Kconfig b/fs/dlm/Kconfig index 2dbb422..f27a99a 100644 --- a/fs/dlm/Kconfig +++ b/fs/dlm/Kconfig @@ -4,6 +4,7 @@ menuconfig DLM depends on SYSFS (IPV6 || IPV6=n) select CONFIGFS_FS select IP_SCTP + select JHASH help A general purpose distributed lock manager for kernel or userspace applications. diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig index de8e64c..b9dcabf 100644 --- a/fs/gfs2/Kconfig +++ b/fs/gfs2/Kconfig @@ -3,6 +3,7 @@ config GFS2_FS depends on EXPERIMENTAL select FS_POSIX_ACL select CRC32 + select JHASH help A cluster filesystem. diff --git a/include/linux/jhash.h b/include/linux/jhash.h index 2a2f99f..14200c6 100644 --- a/include/linux/jhash.h +++ b/include/linux/jhash.h @@ -1,25 +1,6 @@ #ifndef _LINUX_JHASH_H #define _LINUX_JHASH_H -/* jhash.h: Jenkins hash support. - * - * Copyright (C) 1996 Bob Jenkins ([EMAIL PROTECTED]) - * - * http://burtleburtle.net/bob/hash/ - * - * These are the credits from Bob's sources: - * - * lookup2.c, by Bob Jenkins, December 1996, Public Domain. - * hash(), hash2(), hash3, and mix() are externally useful functions. - * Routines to test the hash are included if SELF_TEST is defined. - * You can use this free for any purpose. It has no warranty. - * - * Copyright (C) 2003 David S. Miller ([EMAIL PROTECTED]) - * - * I've modified Bob's hash to be useful in the Linux kernel, and - * any bugs present are surely my fault. -DaveM - */ - /* NOTE: Arguments are modified. */ #define __jhash_mix(a, b, c) \ { \ @@ -41,77 +22,12 @@ * of bytes. No alignment or length assumptions are made about * the input key. */ -static inline u32 jhash(const void *key, u32 length, u32 initval) -{ - u32 a, b, c, len; - const u8 *k = key; - - len = length; - a = b = JHASH_GOLDEN_RATIO; - c = initval; - - while (len = 12) { - a += (k[0] +((u32)k[1]8) +((u32)k[2]16) +((u32)k[3]24)); - b += (k[4] +((u32)k[5]8) +((u32)k[6]16) +((u32)k[7]24)); - c += (k[8] +((u32)k[9]8) +((u32)k[10]16)+((u32)k[11]24)); - - __jhash_mix(a,b,c); - - k += 12; - len -= 12; - } - - c += length; - switch (len) { - case 11: c += ((u32)k[10]24); - case 10: c += ((u32)k[9]16); - case 9 : c += ((u32)k[8]8); - case 8 : b += ((u32)k[7]24); - case 7 : b += ((u32)k[6]16); - case 6 : b += ((u32)k[5]8); - case 5 : b += k[4]; - case 4 : a += ((u32)k[3]24); - case 3 : a += ((u32)k[2]16); - case 2 : a += ((u32)k[1]8); - case 1 : a += k[0]; - }; - - __jhash_mix(a,b,c); - - return c; -} +extern u32 jhash(const void *key, u32 length, u32 initval); /* A special optimized version that handles 1 or more of u32s
[RFC PATCH 5/8] [NET]: uninline dst_release
Codiff stats: -16420 187 funcs, 103 +, 16523 -, diff: -16420 --- dst_release Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/net/dst.h | 10 +- net/core/dst.c| 10 ++ 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index e3ac7d0..bf33471 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -158,15 +158,7 @@ struct dst_entry * dst_clone(struct dst_entry * dst) return dst; } -static inline -void dst_release(struct dst_entry * dst) -{ - if (dst) { - WARN_ON(atomic_read(dst-__refcnt) 1); - smp_mb__before_atomic_dec(); - atomic_dec(dst-__refcnt); - } -} +extern void dst_release(struct dst_entry *dst); /* Children define the path of the packet through the * Linux networking. Thus, destinations are stackable. diff --git a/net/core/dst.c b/net/core/dst.c index 7deef48..cc2e724 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -259,6 +259,16 @@ again: return NULL; } +void dst_release(struct dst_entry *dst) +{ + if (dst) { + WARN_ON(atomic_read(dst-__refcnt) 1); + smp_mb__before_atomic_dec(); + atomic_dec(dst-__refcnt); + } +} +EXPORT_SYMBOL(dst_release); + /* Dirty hack. We did it in 2.2 (in __dst_free), * we have _very_ good reasons not to repeat * this mistake in 2.3, but we have no choice -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot
On Wed, 20 Feb 2008, Jan Engelhardt wrote: On Feb 20 2008 17:27, Patrick McHardy wrote: Striking. How can this even happen? A callsite which calls dev_alloc_skb(n) is just equivalent to __dev_alloc_skb(n, GFP_ATOMIC); which means there's like 4 (or 8 if it's long) bytes more on the stack. For a worst case, count in another 8 bytes for push and pop or mov on the stack. But that still does not add up to 23 kb. I think you misunderstood the results, if I uninlined dev_alloc_skb(), it _alone_ was uninlined which basically means that __dev_alloc_skb() that is inline as well is included inside that uninlined function. When both were inlined, they add up to everywhere, and uninlining dev_alloc_skb alone mitigates that for both(!) of them in every place where dev_alloc_skb is being called. Because __dev_alloc_skb call sites are few, most benefits show up already with dev_alloc_skb uninlining alone. On the other hand, if __dev_alloc_skb is uninlined, the size reasoning you used above applies to dev_alloc_skb callsites, and that is definately less than 23kB. __dev_alloc_skb() is also an inline function which performs some extra work. Which raises the question - if dev_alloc_skb() is uninlined, shouldn't __dev_alloc_skb() be uninline as well? Of course that could be done as well, however, I wouldn't be too keen to deepen callchain by both of them, ie., uninlined dev_alloc_skb would just contain few bytes which perform the call to __dev_alloc_skb which has the bit larger content due to that extra work. IMHO the best solution would duplicate the extra work to both of them on binary level (obviously not on the source level), e.g., by adding static inline ___dev_alloc_skb() to .h which is inlined to both of the variants. I'm not too sure if inline to __dev_alloc_skb() alone is enough in .c file to result in inlining of __dev_alloc_skb to dev_alloc_skb (with all gcc versions and relevant optimization settings). I'd like to see the results when {__dev_alloc_skb is externed and dev_alloc_skb remains inlined}. The results are right under your nose already... ;-) See from the list of the series introduction: http://marc.info/?l=linux-netdevm=120351526210711w=2 IMHO more interesting number (which I currently don't have) is the _remaining_ benefits of uninlining __dev_alloc_skb after dev_alloc_skb was first uninlined. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 7/8] [SCTP]: uninline sctp_add_cmd_sf
On Wed, 20 Feb 2008, Vlad Yasevich wrote: Ilpo Järvinen wrote: I added inline to sctp_add_cmd and appropriate comment there to avoid adding another call into the call chain. This works at least with gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Alternatively, __sctp_add_cmd could be introduced to .h. My only concern was performance regressions, but it looks like it doesn't effect anything from the few quick runs I've made. There was one call made anyway, it's a bit hard to see how it would hurt to push that BUG() deeper down (in fact, this is one of the easiest case in this respect, many other cases elsewhere that need uninlining don't currently make any calls with inlines). Since we are putting sctp_add_cmd_sf() on the call stack, we might as well get rid of sctp_add_cmd() and reduce it a bit more. IMHO it is definately better solution for archiving the size reduction, I just didn't know before that the only sctp_add_cmd call could be converted. [...snip...] diff --git a/net/sctp/command.c b/net/sctp/command.c index bb97733..3a06513 100644 --- a/net/sctp/command.c +++ b/net/sctp/command.c @@ -51,19 +51,16 @@ int sctp_init_cmd_seq(sctp_cmd_seq_t *seq) /* Add a command to a sctp_cmd_seq_t. * Return 0 if the command sequence is full. + * + * Inline here is not a mistake, this way sctp_add_cmd_sf doesn't need extra + * calls, size penalty is of insignificant magnitude here This won't be a necessary note anymore. :-) [...snip...] -- i.
[PATCH] [XFRM] BEET: Remove extra semicolon after if
Once again, one of this kind tries to creep in. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/xfrm4_mode_beet.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/xfrm4_mode_beet.c b/net/ipv4/xfrm4_mode_beet.c index e093a7b..b47030b 100644 --- a/net/ipv4/xfrm4_mode_beet.c +++ b/net/ipv4/xfrm4_mode_beet.c @@ -102,7 +102,7 @@ static int xfrm4_beet_input(struct xfrm_state *x, struct sk_buff *skb) XFRM_MODE_SKB_CB(skb)-protocol = ph-nexthdr; - if (!pskb_may_pull(skb, phlen)); + if (!pskb_may_pull(skb, phlen)) goto out; __skb_pull(skb, phlen); } -- 1.5.2.2
[RFC PATCH]: TCP + SACK + skb processing latency
Hi all, Here's an attempt to reduce amount of skb cleanup work TCP with TSO has to do after SACKs have arrived. I'm not on very familiar grounds with TSOed skbs so there likely is much I just couldn't take into account. I probably should at least check some flag somewhere? (=NETIF_F_SG?). Also the current the length handling and potential presence of headers worries me a lot. I'd appreciate if somebody with more knowledge about skb internals could take a look at the relevant parts (mainly skb_shift) and confirm that doing all this I'm trying to do is allowed :-). -- i. -- [RFC PATCH] [TCP]: Try to restore large SKBs while SACK processing During SACK processing, most of the benefits of TSO are eaten by the SACK blocks that one-by-one fragment SKBs to MSS sized chunks. Then we're in problems when cleanup work for them has to be done when a large cumulative ACK comes. Try to return to pre-split state while more and more SACK info gets discovered. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/net/tcp.h|5 + net/ipv4/tcp_input.c | 206 +- 2 files changed, 209 insertions(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..cdf4468 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1170,6 +1170,11 @@ static inline struct sk_buff *tcp_write_queue_next(struct sock *sk, struct sk_bu return skb-next; } +static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_buff *skb) +{ + return skb-prev; +} + #define tcp_for_write_queue(skb, sk) \ for (skb = (sk)-sk_write_queue.next; \ (skb != (struct sk_buff *)(sk)-sk_write_queue); \ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 19c449f..2cb7e86 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1206,6 +1206,8 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, * aligned portion of it that matches. Therefore we might need to fragment * which may fail and creates some hassle (caller must handle error case * returns). + * + * FIXME: this could be merged to shift decision code */ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb, u32 start_seq, u32 end_seq) @@ -1322,6 +1324,206 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk, return flag; } +/* Attempts to shift up to shiftlen worth of bytes from prev to skb. + * Returns number bytes shifted. + * + * TODO: in case the prev runs out of frag space, operation could be + * made to return with a partial result (would allow tighter packing). + */ +static int skb_shift(struct sk_buff *prev, struct sk_buff *skb, +unsigned int shiftlen) +{ + int i, to, merge; + unsigned int todo; + struct skb_frag_struct *from, *fragto; + + if (skb_cloned(skb) || skb_cloned(prev)) + return 0; + + todo = shiftlen; + i = 0; + from = skb_shinfo(skb)-frags[i]; + to = skb_shinfo(prev)-nr_frags; + + merge = to - 1; + if (!skb_can_coalesce(prev, merge + 1, from-page, from-page_offset)) + merge = -1; + if (merge = 0) { + i++; + if (from-size = shiftlen) + goto onlymerge; + todo -= from-size; + } + + /* Skip full, not-fitting skb to avoid expensive operations */ + if ((shiftlen == skb-len) + (skb_shinfo(skb)-nr_frags - merge) (MAX_SKB_FRAGS - to)) + return 0; + + while (todo (i skb_shinfo(skb)-nr_frags)) { + if (to == MAX_SKB_FRAGS) + return 0; + + from = skb_shinfo(skb)-frags[i]; + fragto = skb_shinfo(prev)-frags[to]; + + if (todo = from-size) { + *fragto = *from; + todo -= from-size; + i++; + to++; + + } else { + fragto-page = from-page; + get_page(fragto-page); + fragto-page_offset = from-page_offset; + fragto-size = todo; + from-page_offset += todo; + from-size -= todo; + + to++; + break; + } + } + skb_shinfo(prev)-nr_frags = to; + + /* Delayed, so that we don't have to undo it */ + if (merge = 0) { +onlymerge: + from = skb_shinfo(skb)-frags[0]; + skb_shinfo(prev)-frags[merge].size += min(from-size, shiftlen); + put_page(from-page); + } + + /* Reposition in the original skb */ + to = 0; + while (i skb_shinfo(skb)-nr_frags) + skb_shinfo(skb)-frags[to++] = skb_shinfo(skb
Re: [RFC PATCH]: TCP + SACK + skb processing latency
On Fri, 1 Feb 2008, Ilpo Järvinen wrote: @@ -1322,6 +1324,206 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk, return flag; } +/* Attempts to shift up to shiftlen worth of bytes from prev to skb. + * Returns number bytes shifted. + * + * TODO: in case the prev runs out of frag space, operation could be + * made to return with a partial result (would allow tighter packing). + */ +static int skb_shift(struct sk_buff *prev, struct sk_buff *skb, + unsigned int shiftlen) +{ + int i, to, merge; + unsigned int todo; + struct skb_frag_struct *from, *fragto; + + if (skb_cloned(skb) || skb_cloned(prev)) + return 0; + + todo = shiftlen; + i = 0; + from = skb_shinfo(skb)-frags[i]; + to = skb_shinfo(prev)-nr_frags; + + merge = to - 1; + if (!skb_can_coalesce(prev, merge + 1, from-page, from-page_offset)) + merge = -1; + if (merge = 0) { + i++; + if (from-size = shiftlen) + goto onlymerge; + todo -= from-size; + } + + /* Skip full, not-fitting skb to avoid expensive operations */ + if ((shiftlen == skb-len) + (skb_shinfo(skb)-nr_frags - merge) (MAX_SKB_FRAGS - to)) Before somebody else notices it: s/merge/i/, a leftover from flag - idx conversion. + return 0; + + while (todo (i skb_shinfo(skb)-nr_frags)) { + if (to == MAX_SKB_FRAGS) + return 0; + -- i.
Re: [2.6 patch] unexport sysctl_tcp_tso_win_divisor
On Wed, 30 Jan 2008, Adrian Bunk wrote: This patch removes the no longer used EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor). Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- 4884e7997ba5f63f2efeaeead21ed2768fb3f4de diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 89f0188..ed750f9 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2564,5 +2564,4 @@ EXPORT_SYMBOL(tcp_connect); EXPORT_SYMBOL(tcp_make_synack); EXPORT_SYMBOL(tcp_simple_retransmit); EXPORT_SYMBOL(tcp_sync_mss); -EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor); EXPORT_SYMBOL(tcp_mtup_init); ...yes, results from the recent move of tcp_is_cwnd_limited() away from tcp.h. Acked-by: Ilpo Järvinen [EMAIL PROTECTED] -- i.
[PATCH] [TIPC] Kill unused static inline (x5)
From: =?ISO-8859-1?q?Ilpo=20J=E4rvinen?= [EMAIL PROTECTED] All these static inlines are unused: in_own_zone 1 (net/tipc/addr.h) msg_dataoctet 1 (net/tipc/msg.h) msg_direct 1 (include/net/tipc/tipc_msg.h) msg_options 1 (include/net/tipc/tipc_msg.h) tipc_nmap_get 1 (net/tipc/bcast.h) Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/net/tipc/tipc_msg.h | 16 net/tipc/addr.h |5 - net/tipc/bcast.h| 13 - net/tipc/msg.h |5 - 4 files changed, 0 insertions(+), 39 deletions(-) diff --git a/include/net/tipc/tipc_msg.h b/include/net/tipc/tipc_msg.h index fb42eb7..2e159a8 100644 --- a/include/net/tipc/tipc_msg.h +++ b/include/net/tipc/tipc_msg.h @@ -130,11 +130,6 @@ static inline u32 msg_type(struct tipc_msg *m) return msg_bits(m, 1, 29, 0x7); } -static inline u32 msg_direct(struct tipc_msg *m) -{ - return (msg_type(m) == TIPC_DIRECT_MSG); -} - static inline u32 msg_named(struct tipc_msg *m) { return (msg_type(m) == TIPC_NAMED_MSG); @@ -207,17 +202,6 @@ static inline u32 msg_nameupper(struct tipc_msg *m) return msg_word(m, 10); } -static inline char *msg_options(struct tipc_msg *m, u32 *len) -{ - u32 pos = msg_bits(m, 1, 16, 0x7); - - if (!pos) - return 0; - pos = (pos * 4) + 28; - *len = msg_hdr_sz(m) - pos; - return (char *)m-hdr[pos/4]; -} - #endif #endif diff --git a/net/tipc/addr.h b/net/tipc/addr.h index e4bd533..3ba67e6 100644 --- a/net/tipc/addr.h +++ b/net/tipc/addr.h @@ -57,11 +57,6 @@ static inline int in_own_cluster(u32 addr) return !((addr ^ tipc_own_addr) 12); } -static inline int in_own_zone(u32 addr) -{ - return !((addr ^ tipc_own_addr) 24); -} - static inline int is_slave(u32 addr) { return addr 0x800; diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index f910ed2..a2416fa 100644 --- a/net/tipc/bcast.h +++ b/net/tipc/bcast.h @@ -74,19 +74,6 @@ extern char tipc_bclink_name[]; /** - * nmap_get - determine if node exists in a node map - */ - -static inline int tipc_nmap_get(struct node_map *nm_ptr, u32 node) -{ - int n = tipc_node(node); - int w = n / WSIZE; - int b = n % WSIZE; - - return nm_ptr-map[w] (1 b); -} - -/** * nmap_add - add a node to a node map */ diff --git a/net/tipc/msg.h b/net/tipc/msg.h index ce26598..e9ef6df 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -663,11 +663,6 @@ static inline void msg_set_remote_node(struct tipc_msg *m, u32 a) msg_set_word(m, msg_hdr_sz(m)/4, a); } -static inline int msg_dataoctet(struct tipc_msg *m, u32 pos) -{ - return(msg_data(m)[pos + 4] != 0); -} - static inline void msg_set_dataoctet(struct tipc_msg *m, u32 pos) { msg_data(m)[pos + 4] = 1; -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
[c0155659] ? trace_hardirqs_on+0xb9/0x130 [c01059ca] common_interrupt+0x2e/0x34 [c0103390] ? mwait_idle_with_hints+0x40/0x50 [c01033a0] ? mwait_idle+0x0/0x20 [c01033b2] mwait_idle+0x12/0x20 [c0103141] cpu_idle+0x61/0x110 [c04339fd] rest_init+0x5d/0x60 [c05a47fa] start_kernel+0x1fa/0x260 [c05a4190] ? unknown_bootoption+0x0/0x130 === ---[ end trace 14b601818e6903ac ]--- ...But this no longer is, and even more, L: 5 is not valid state at this point all (should only happen if we went to RTO but it would reset S to zero with newreno): P: 5 L: 5 vs 5 S: 0 vs 3 w: 2044790889-2044796616 (0) TCP wq(s) l TCP wq(h) +++h+ l5 s3 f0 p5 seq: su2044790889 hs2044795029 sn2044796616 Surprisingly, it was the first time the WARN_ON for left_out returned correct location. This also explains why the patch I sent to Krishna didn't print anything (it didn't end up into printing because I forgot to add L+SP check into to the state checking if). ...so please, could you (others than Denys) try this patch, it should solve the issue. And Denys, could you confirm (and if necessary double check) that the kernel you saw this similar problem with is the pure Linus' mainline, i.e., without any net-2.6.25 or mm bits please, if so, that problem persists. And anyway, there were some fackets_out related problems reported as well and this doesn't help for that but I think I've lost track of who was seeing it due to large number of reports :-), could somebody refresh my memory because I currently don't have time to dig it up from archives (at least on this week). -- i. -- [PATCH] [TCP]: NewReno must count every skb while marking losses NewReno should add cnt per skb (as with FACK) instead of depending on SACKED_ACKED bits which won't be set with it at all. Effectively, NewReno should always exists after the first iteration anyway (or immediately if there's already head in lost_out. This was fixed earlier in net-2.6.25 but got reverted among other stuff and I didn't notice that this is still necessary (actually wasn't even considering this case while trying to figure out the reports because I lived with different kind of code than it in reality was). This should solve the WARN_ONs in TCP code that as a result of this triggered multiple times in every place we check for this invariant. Special thanks to Dave Young [EMAIL PROTECTED] and Krishna Kumar2 [EMAIL PROTECTED] for trying with my debug patches. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 295490e..aa409a5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2156,7 +2156,7 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) tp-lost_skb_hint = skb; tp-lost_cnt_hint = cnt; - if (tcp_is_fack(tp) || + if (tcp_is_fack(tp) || tcp_is_reno(tp) || (TCP_SKB_CB(skb)-sacked TCPCB_SACKED_ACKED)) cnt += tcp_skb_pcount(skb); -- 1.5.2.2
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Thu, 24 Jan 2008, Ilpo Järvinen wrote: And anyway, there were some fackets_out related problems reported as well and this doesn't help for that but I think I've lost track of who was seeing it due to large number of reports :-), could somebody refresh my memory because I currently don't have time to dig it up from archives (at least on this week). Here's the updated debug patch for net-2.6.25/mm for tracking fackets_out inconsistencies (it won't work for trees which don't include net-2.6.25, mm does of course :-)). I hope I got it into good shape this time to avoid spurious stacktraces but still maintaining 100% accuracy, but it's not a simple oneliner so I might have missed something... :-) I'd suggest that people trying with this first apply the newreno fix of the previous mail to avoid already-fixed case from triggering. -- i. -- [PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline) --- include/net/tcp.h |5 ++- net/ipv4/tcp_input.c | 18 +++- net/ipv4/tcp_ipv4.c | 127 + net/ipv4/tcp_output.c | 23 +++-- 4 files changed, 165 insertions(+), 8 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..552aa71 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,9 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern void tcp_print_queue(struct sock *sk); +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +771,7 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) tp-packets_out) +#define tcp_verify_left_out(tp)tcp_verify_wq((struct sock *)tp) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 19c449f..c897c93 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1426,8 +1426,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, int first_sack_index; if (!tp-sacked_out) { - if (WARN_ON(tp-fackets_out)) + if (WARN_ON(tp-fackets_out)) { + tcp_verify_left_out(tp); tp-fackets_out = 0; + } tcp_highest_sack_reset(sk); } @@ -2136,6 +2138,8 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) struct sk_buff *skb; int cnt; + tcp_verify_left_out(tp); + BUG_TRAP(packets = tp-packets_out); if (tp-lost_skb_hint) { skb = tp-lost_skb_hint; @@ -2501,6 +2505,8 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) (tcp_fackets_out(tp) tp-reordering)); int fast_rexmit = 0; + tcp_verify_left_out(tp); + if (WARN_ON(!tp-packets_out tp-sacked_out)) tp-sacked_out = 0; if (WARN_ON(!tp-sacked_out tp-fackets_out)) @@ -2645,6 +2651,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (do_lost || (tcp_is_fack(tp) tcp_head_timedout(sk))) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_down(sk, flag); + + WARN_ON(tcp_write_queue_head(sk) == NULL); + WARN_ON(!tp-packets_out); + tcp_xmit_retransmit_queue(sk); } @@ -2848,6 +2858,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + tcp_verify_left_out(tp); + if (skb (TCP_SKB_CB(skb)-sacked TCPCB_SACKED_ACKED)) flag |= FLAG_SACK_RENEGING; @@ -3175,6 +3187,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_fackets = tp-fackets_out; prior_in_flight = tcp_packets_in_flight(tp); + tcp_verify_left_out(tp); + if (!(flag FLAG_SLOWPATH) after(ack, prior_snd_una)) { /* Window is constant, pure forward advance. * No more checks are required. @@ -3237,6 +3251,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) if ((flag FLAG_FORWARD_PROGRESS) || !(flag FLAG_NOT_DUP)) dst_confirm(sk-sk_dst_cache); + tcp_verify_left_out(tp); + return 1; no_queue: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9aea88b..e6e3ad5 100644 --- a/net/ipv4
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Wed, 23 Jan 2008, Dave Young wrote: On Jan 23, 2008 3:41 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote: On Tue, 22 Jan 2008, David Miller wrote: From: Dave Young [EMAIL PROTECTED] Date: Wed, 23 Jan 2008 09:44:30 +0800 On Jan 22, 2008 6:47 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote: [PATCH] [TCP]: debug S+L Thanks, If there's new findings I will let you know. Thanks for helping with this bug Dave. I noticed btw that there thing might (is likely to) spuriously trigger at WARN_ON(sacked != tp-sacked_out); because those won't be equal when SACK is not enabled. If that does happen too often, I send a fixed patch for it, yet, the fact that I print print tp-rx_opt.sack_ok allows identification of those cases already as it's zero when SACK is not enabled. Just ask if you need the updated debug patch. Thanks, please send, I would like to get it. There you go. I fixed non-SACK case by adding tcp_is_sack checks there and also added two verifys to tcp_ack to see if there's corruption outside of TCP. -- i. [PATCH] [TCP]: debug S+L --- include/net/tcp.h |8 +++- net/ipv4/tcp_input.c | 10 + net/ipv4/tcp_ipv4.c | 101 + net/ipv4/tcp_output.c | 21 +++--- 4 files changed, 133 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..0685035 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) tp-packets_out) +#define tcp_verify_left_out(tp)\ + do { \ + WARN_ON(tcp_left_out(tp) tp-packets_out); \ + tcp_verify_wq((struct sock *)tp); \ + } while(0) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index fa2c85c..cdacf70 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (do_lost || (tcp_is_fack(tp) tcp_head_timedout(sk))) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_down(sk, flag); + + WARN_ON(tcp_write_queue_head(sk) == NULL); + WARN_ON(!tp-packets_out); + tcp_xmit_retransmit_queue(sk); } @@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + tcp_verify_left_out(tp); + if (skb (TCP_SKB_CB(skb)-sacked TCPCB_SACKED_ACKED)) flag |= FLAG_SACK_RENEGING; @@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_fackets = tp-fackets_out; prior_in_flight = tcp_packets_in_flight(tp); + tcp_verify_left_out(tp); + if (!(flag FLAG_SLOWPATH) after(ack, prior_snd_una)) { /* Window is constant, pure forward advance. * No more checks are required. @@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) if ((flag FLAG_FORWARD_PROGRESS) || !(flag FLAG_NOT_DUP)) dst_confirm(sk-sk_dst_cache); + tcp_verify_left_out(tp); + return 1; no_queue: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9aea88b..7e8ab40 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -108,6 +108,107 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = { .lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait), }; +void tcp_print_queue(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb; + char s[50+1]; + char h[50+1]; + int idx = 0; + int i; + + tcp_for_write_queue(skb, sk) { + if (skb == tcp_send_head(sk)) + break; + + for (i = 0; i tcp_skb_pcount(skb); i++) { + if (TCP_SKB_CB(skb)-sacked TCPCB_SACKED_ACKED) { + s[idx] = 'S'; + if (TCP_SKB_CB(skb)-sacked TCPCB_LOST) + s[idx] = 'B'; + + } else if (TCP_SKB_CB(skb)-sacked
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Krishna Kumar2 wrote: David Miller [EMAIL PROTECTED] wrote on 01/23/2008 01:27:23 PM: iperf with multiple threads almost always gets these 4, *especially* when I do some batching :). static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) { ... if (WARN_ON(!tp-sacked_out tp-fackets_out)) tp-fackets_out = 0; ... } Does this assertion show up first or do you get the other TCP ones first? It might be important, in that if you get the others ones first that corrupted state might be what leads to this one. Hi Dave, I looked at my *old* messages file and found this assert (2506) was first to hit (atleast in two messages file). It hit 5 times, then I got a different one that I had not reported earlier: KERNEL: assertion (packets = tp-packets_out) failed at net/ipv4/tcp_input.c (2139) (though this was hidden in my report under the panic for tcp_input.c:2528. Then another two thousand times of the 2506 asserts. Today I installed the latest untouched kernel, rebooted system and got the following errors in sequence, but no 2506 errors (which I have always got when running batching in the last 2-3 weeks): Jan 22 02:07:55 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:07:57 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:07:58 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:00 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:02 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:04 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:05 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:08 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:09 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:11 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:13 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:14 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 Jan 22 02:08:17 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 and so on for another 700 counts. The unique asserts are: 1767: tcp_verify_left_out (from tcp_entry_frto) 2169: tcp_verify_left_out (from tcp_mark_head_lost) 2528: tcp_verify_left_out (from tcp_fastretrans_alert) 3063: tcp_verify_left_out (from tcp_process_frto) (where 2169 seems to preceed any other asserts) Once you get one of these, you'll get a large number of them, maybe I should just change it to WARN_ON_ONCE to stop confusing people with the rest. The other two asserts that I got only with batching are: 2139: BUG_TRAP(packets = tp-packets_out); (in tcp_mark_head_lost) 2506: WARN_ON(!tp-sacked_out tp-fackets_out) (in tcp_fastretrans_alert) (where 2506 always seems to preceed any other asserts). It's almost impossible to know which of these is the main cause and the first occuring due to reasons I'll not copy here. What a strange thing that it has been super quiet on this front until now everybody is seeing it, could there be something unrelated to TCP which has broken it all recently? Good thing is that you seem to be able to reproduce it were nicely. Please try
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Krishna Kumar2 wrote: Hi Ilpo, It's almost impossible to know which of these is the main cause and the first occuring due to reasons I'll not copy here. What a strange thing that it has been super quiet on this front until now everybody is seeing it, could there be something unrelated to TCP which has broken it all recently? I have been getting this for atleast 3 weeks but I was quiet since those were kernels that I had modified. Since you can easily reproduce it, lets just figure out what's causing it hard way, rather than digging the endless git-logs... :-) Good thing is that you seem to be able to reproduce it were nicely. Please try with this beauty below... Hopefully got it correctly matching to mainline, in contrast to version I sent Dave Y., there's some added candy which catches highest_sack corruption as well, at least it compiles already :-). There were couple of patch apply failures in .c files which I fixed by hand. But when compiling, I got these errors (I am using DM's 2.6.24-rc7 kernel, net-2.6.25.git): Well, that's annoying, you didn't mention net-2.6.25 back then, it sure is incompatible with it like already the patch title said... :-) net/ipv4/tcp_output.c: In function 'tcp_push_one': net/ipv4/tcp_output.c:1573: error: 'tp' undeclared (first use in this function) net/ipv4/tcp_output.c:1573: error: (Each undeclared identifier is reported only once net/ipv4/tcp_output.c:1573: error: for each function it appears in.) ...This is either due to mismerge or due your modifications. Lets re-iterate, compiled ok for me tcp_{ipv4,input,output}.o... -- i. [PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline) --- include/net/tcp.h |8 +++- net/ipv4/tcp_input.c | 10 + net/ipv4/tcp_ipv4.c | 107 + net/ipv4/tcp_output.c | 21 +++--- 4 files changed, 139 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..0685035 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) tp-packets_out) +#define tcp_verify_left_out(tp)\ + do { \ + WARN_ON(tcp_left_out(tp) tp-packets_out); \ + tcp_verify_wq((struct sock *)tp); \ + } while(0) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index fa2c85c..cdacf70 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (do_lost || (tcp_is_fack(tp) tcp_head_timedout(sk))) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_down(sk, flag); + + WARN_ON(tcp_write_queue_head(sk) == NULL); + WARN_ON(!tp-packets_out); + tcp_xmit_retransmit_queue(sk); } @@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + tcp_verify_left_out(tp); + if (skb (TCP_SKB_CB(skb)-sacked TCPCB_SACKED_ACKED)) flag |= FLAG_SACK_RENEGING; @@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_fackets = tp-fackets_out; prior_in_flight = tcp_packets_in_flight(tp); + tcp_verify_left_out(tp); + if (!(flag FLAG_SLOWPATH) after(ack, prior_snd_una)) { /* Window is constant, pure forward advance. * No more checks are required. @@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) if ((flag FLAG_FORWARD_PROGRESS) || !(flag FLAG_NOT_DUP)) dst_confirm(sk-sk_dst_cache); + tcp_verify_left_out(tp); + return 1; no_queue: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9aea88b..c95682e 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -108,6 +108,113 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = { .lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait), }; +void tcp_print_queue(struct sock *sk)
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Wed, 23 Jan 2008, Ilpo Järvinen wrote: On Wed, 23 Jan 2008, Dave Young wrote: On Jan 23, 2008 3:41 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote: On Tue, 22 Jan 2008, David Miller wrote: From: Dave Young [EMAIL PROTECTED] Date: Wed, 23 Jan 2008 09:44:30 +0800 On Jan 22, 2008 6:47 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote: [PATCH] [TCP]: debug S+L Thanks, If there's new findings I will let you know. Thanks for helping with this bug Dave. I noticed btw that there thing might (is likely to) spuriously trigger at WARN_ON(sacked != tp-sacked_out); because those won't be equal when SACK is not enabled. If that does happen too often, I send a fixed patch for it, yet, the fact that I print print tp-rx_opt.sack_ok allows identification of those cases already as it's zero when SACK is not enabled. Just ask if you need the updated debug patch. Thanks, please send, I would like to get it. There you go. I fixed non-SACK case by adding tcp_is_sack checks there and also added two verifys to tcp_ack to see if there's corruption outside of TCP. There's some discussion about a problem that is very likely the same as in here (sorry for not remembering to cc you in there due to rapid progress): http://marc.info/?t=12010717423r=1w=2 -- i.
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Ilpo Järvinen wrote: On Wed, 23 Jan 2008, Krishna Kumar2 wrote: Hi Ilpo, It's almost impossible to know which of these is the main cause and the first occuring due to reasons I'll not copy here. What a strange thing that it has been super quiet on this front until now everybody is seeing it, could there be something unrelated to TCP which has broken it all recently? I have been getting this for atleast 3 weeks but I was quiet since those were kernels that I had modified. Since you can easily reproduce it, lets just figure out what's causing it hard way, rather than digging the endless git-logs... :-) Hmm, perhaps it could be something related to this (and some untested path somewhere which is now exposed): commit 4a55b553f691abadaa63570dfc714e20913561c1 Author: Ilpo Järvinen [EMAIL PROTECTED] Date: Thu Dec 20 20:36:03 2007 -0800 [TCP]: Fix TSO deferring Dave, what do you think? Wouldn't explain the one -rc only report though from Denys. Another one I'm a bit unsure of is this: commit 757c32944b80fd95542bd66f06032ab773034d53 Author: Ilpo Järvinen [EMAIL PROTECTED] Date: Thu Jan 3 20:39:01 2008 -0800 [TCP]: Perform setting of common control fields in one place -sacked field is cleared in tcp_retransmit_skb due to a subtle change, which might be buggy However, I find it rather unlikely that this would explain Kumar's case. Anyway, here's the one that reverts the problematic part of it. ...and this is net-2.6.25 as well so it won't solve Denys' case either. -- i. -- [PATCH] [TCP]: Revert part of [TCP]: Perform setting of common control... This commit reverts part of 757c32944b80fd95542bd66f06032ab773034d53 ([TCP]: Perform setting of common control fields in one place) because it's not valid to clear -sacked field like that without additional precautions with counters, mostly lost_out, sacked_out should not be set due to reneging which kicks in much earlier than this. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_output.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 648340f..f6cbc1f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1887,10 +1887,12 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) (TCP_SKB_CB(skb)-flags TCPCB_FLAG_FIN) tp-snd_una == (TCP_SKB_CB(skb)-end_seq - 1)) { if (!pskb_trim(skb, 0)) { - /* Reuse, even though it does some unnecessary work */ - tcp_init_nondata_skb(skb, TCP_SKB_CB(skb)-end_seq - 1, -TCP_SKB_CB(skb)-flags); + TCP_SKB_CB(skb)-seq = TCP_SKB_CB(skb)-end_seq - 1; + skb_shinfo(skb)-gso_segs = 1; + skb_shinfo(skb)-gso_size = 0; + skb_shinfo(skb)-gso_type = 0; skb-ip_summed = CHECKSUM_NONE; + skb-csum = 0; } } -- 1.5.2.2
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Krishna Kumar2 wrote: While running with this patch, I got these errors (pasted at the end of this mail). I don't have a clue why it didn't go to the checking func (or it didn't print anything) but just had those WARN_ONs... Hopefully this is giving somewhat better input (applies on top of the other debug patch). -- i. [PATCH] [TCP]: more debug --- include/net/tcp.h|3 ++- net/ipv4/tcp_input.c |9 - net/ipv4/tcp_ipv4.c | 19 ++- 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 0685035..129c3b1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,7 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern void tcp_print_queue(struct sock *sk); extern voidtcp_verify_wq(struct sock *sk); extern voidtcp_v4_err(struct sk_buff *skb, u32); @@ -772,8 +773,8 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) /* Use define here intentionally to get WARN_ON location shown at the caller */ #define tcp_verify_left_out(tp)\ do { \ - WARN_ON(tcp_left_out(tp) tp-packets_out); \ tcp_verify_wq((struct sock *)tp); \ + WARN_ON(tcp_left_out(tp) tp-packets_out); \ } while(0) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cdacf70..295490e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2133,12 +2133,15 @@ static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb) static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) { struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; + struct sk_buff *skb, *prev = NULL; int cnt; + tcp_verify_left_out(tp); + BUG_TRAP(packets = tp-packets_out); if (tp-lost_skb_hint) { skb = tp-lost_skb_hint; + prev = skb; cnt = tp-lost_cnt_hint; } else { skb = tcp_write_queue_head(sk); @@ -2166,6 +2169,10 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) tcp_verify_retransmit_hint(tp, skb); } } + if (tcp_left_out(tp) tp-packets_out) { + printk(KERN_ERR Prev hint: %p, exit %p\n, prev, skb); + tcp_print_queue(sk); + } tcp_verify_left_out(tp); } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c95682e..c2a88c5 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -117,6 +117,15 @@ void tcp_print_queue(struct sock *sk) int idx = 0; int i; + i = 0; + tcp_for_write_queue(skb, sk) { + if (skb == tcp_send_head(sk)) + printk(KERN_ERR head %u %p\n, i, skb); + else + printk(KERN_ERR skb %u %p\n, i, skb); + i++; + } + tcp_for_write_queue(skb, sk) { if (skb == tcp_send_head(sk)) break; @@ -195,11 +204,6 @@ void tcp_verify_wq(struct sock *sk) packets += tcp_skb_pcount(skb); } - WARN_ON(lost != tp-lost_out); - WARN_ON(tcp_is_sack(tp) (sacked != tp-sacked_out)); - WARN_ON(packets != tp-packets_out); - WARN_ON(fackets != tp-fackets_out); - if ((lost != tp-lost_out) || (tcp_is_sack(tp) (sacked != tp-sacked_out)) || (packets != tp-packets_out) || @@ -213,6 +217,11 @@ void tcp_verify_wq(struct sock *sk) tp-rx_opt.sack_ok); tcp_print_queue(sk); } + + WARN_ON(lost != tp-lost_out); + WARN_ON(tcp_is_sack(tp) (sacked != tp-sacked_out)); + WARN_ON(packets != tp-packets_out); + WARN_ON(fackets != tp-fackets_out); } static int tcp_v4_get_port(struct sock *sk, unsigned short snum) -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Tue, 22 Jan 2008, Dave Young wrote: On Jan 22, 2008 12:37 PM, Dave Young [EMAIL PROTECTED] wrote: On Jan 22, 2008 5:14 AM, Ilpo Järvinen [EMAIL PROTECTED] wrote: On Mon, 21 Jan 2008, Dave Young wrote: Please see the kernel messages following,(trigged while using some qemu session) BTW, seems there's some e100 error message as well. PCI: Setting latency timer of device :00:1b.0 to 64 e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI e100: Copyright(c) 1999-2006 Intel Corporation ACPI: PCI Interrupt :03:08.0[A] - GSI 20 (level, low) - IRQ 20 modprobe:2331 conflicting cache attribute efaff000-efb0 uncached-default e100: :03:08.0: e100_probe: Cannot map device registers, aborting. ACPI: PCI interrupt for device :03:08.0 disabled e100: probe of :03:08.0 failed with error -12 eth0: setting full-duplex. [ cut here ] WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150() Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4 [c0132100] ? printk+0x0/0x20 [c0131834] warn_on_slowpath+0x54/0x80 [c03e8df8] ? ip_finish_output+0x128/0x2e0 [c03e9527] ? ip_output+0xe7/0x100 [c03e8a88] ? ip_local_out+0x18/0x20 [c03e991c] ? ip_queue_xmit+0x3dc/0x470 [c043641e] ? _spin_unlock_irqrestore+0x5e/0x70 [c0186be1] ? check_pad_bytes+0x61/0x80 [c03f6031] tcp_mark_head_lost+0x121/0x150 [c03f60ac] tcp_update_scoreboard+0x4c/0x170 [c03f6e0a] tcp_fastretrans_alert+0x48a/0x6b0 [c03f7d93] tcp_ack+0x1b3/0x3a0 [c03fa14b] tcp_rcv_established+0x3eb/0x710 [c04015c5] tcp_v4_do_rcv+0xe5/0x100 [c0401bbb] tcp_v4_rcv+0x5db/0x660 Doh, once more these S+L things..., the rest are symptom of the first problem. What is the S+L thing? Could you explain a bit? It means that one of the skbs is both SACKed and marked as LOST at the same time in the counters (might be due to miscount of lost/sacked_out too, not necessarilily in the -sacked bits). Such state is logically invalid because it would mean that the sender thinks that the same packet both reached the receiver and is lost in the network. Traditionally TCP has just silently corrected over-estimates (sacked_out+lost_out packets_out). I changed this couple of releases ago because those over-estimates often are due to bugs that should be fixed (there have been couple of them but it has been very quite on this front long time, months or even half year already; but I might have broken something with the early Dec changes). These problem may originate from a bug that occurred a number of ACKs earlier the WARN_ON triggered, therefore they are a bit tricky to track, those WARN_ON serve just for alerting purposes and usually do not point out where the bug actually occurred. I usually just asked people to include exhaustive verifier which compares -sacked bitmaps with sacked/lost_out counters and report immediately when the problem shows up, rather than waiting for the cheaper S+L check we do in the WARN_ON to trigger. I tried to collect tracking patch from the previous efforts (hopefully got it right after modifications). I'm a bit worried about its reproducability if it takes this far to see it... It's trigged again in my pc, just while using firefox. ...Good, then there's some chance to catch it. -- i. [PATCH] [TCP]: debug S+L --- include/net/tcp.h |8 +++- net/ipv4/tcp_input.c |6 +++ net/ipv4/tcp_ipv4.c | 101 + net/ipv4/tcp_output.c | 21 +++--- 4 files changed, 129 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..0685035 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) tp-packets_out) +#define tcp_verify_left_out(tp)\ + do { \ + WARN_ON(tcp_left_out(tp) tp
Re: WARNING, tcp_fastretrans_alert, rc6-git11
On Tue, 22 Jan 2008, Denys Fedoryshchenko wrote: Just got on one of proxies, under high load. It is a bit old rc, so probably my report not interesting, but since it is production machines, i cannot change too often. Kernel is 2.6.24-rc6-git11 It's not at all useless, there hasn't been any TCP changes lately so this is still very much valid. Some sysctl adjustments done. Please tell me if need more information. TCP related sysctl setting would be nice to know (I don't care much about mem sizes but others tweaks affecting TCP features are nice to know). ...snip... [9561199.893090] WARNING: at net/ipv4/tcp_input.c:2391 tcp_fastretrans_alert() This check includes fixing if there's miscount (couple of releases ago we didn't even care to print that but had inaccurate fackets_out as a feature, there still may be some left-overs though I've tried to track them down). To find out cause more accurately (this WARN_ON is just for alerting), I'd need to add rather exhaustive searching per ACK which unlikely is acceptable for you. Luckily, Dave Young reported one problem lately (though he was running brand new mm) which could be caused by the same problem as this, so it's not that bad even though you couldn't run skb state verifying. Thanks anyway for the report. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Tue, 22 Jan 2008, David Miller wrote: From: Dave Young [EMAIL PROTECTED] Date: Wed, 23 Jan 2008 09:44:30 +0800 On Jan 22, 2008 6:47 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote: [PATCH] [TCP]: debug S+L Thanks, If there's new findings I will let you know. Thanks for helping with this bug Dave. I noticed btw that there thing might (is likely to) spuriously trigger at WARN_ON(sacked != tp-sacked_out); because those won't be equal when SACK is not enabled. If that does happen too often, I send a fixed patch for it, yet, the fact that I print print tp-rx_opt.sack_ok allows identification of those cases already as it's zero when SACK is not enabled. Just ask if you need the updated debug patch. -- i.
Re: [PATCH 2/4] dsmark: get rid of trivial function
On Mon, 21 Jan 2008, Patrick McHardy wrote: Stephen Hemminger wrote: Replace loop in dsmark_valid_indices with equivalent bit math. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/net/sched/sch_dsmark.c2008-01-20 13:07:58.0 -0800 +++ b/net/sched/sch_dsmark.c2008-01-20 13:22:54.0 -0800 @@ -45,13 +45,8 @@ struct dsmark_qdisc_data { static inline int dsmark_valid_indices(u16 indices) { - while (indices != 1) { - if (indices 1) - return 0; - indices = 1; - } - - return 1; + /* Must have only one bit set */ + return (indices (indices - 1)) == 0; Isn't there some magic under include/linux to do that btw, I suppose that if the caller side zero check is pushed down there too, the is_power_of_2() is 100% match? :-) hweight seems easier to understand, it took me a bit to realize that the comment matches the code :) In addition, the original seems infinite loop with zero indices given but luckily that was checked at the caller site already... -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Mon, 21 Jan 2008, Dave Young wrote: Please see the kernel messages following,(trigged while using some qemu session) BTW, seems there's some e100 error message as well. PCI: Setting latency timer of device :00:1b.0 to 64 e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI e100: Copyright(c) 1999-2006 Intel Corporation ACPI: PCI Interrupt :03:08.0[A] - GSI 20 (level, low) - IRQ 20 modprobe:2331 conflicting cache attribute efaff000-efb0 uncached-default e100: :03:08.0: e100_probe: Cannot map device registers, aborting. ACPI: PCI interrupt for device :03:08.0 disabled e100: probe of :03:08.0 failed with error -12 eth0: setting full-duplex. [ cut here ] WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150() Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4 [c0132100] ? printk+0x0/0x20 [c0131834] warn_on_slowpath+0x54/0x80 [c03e8df8] ? ip_finish_output+0x128/0x2e0 [c03e9527] ? ip_output+0xe7/0x100 [c03e8a88] ? ip_local_out+0x18/0x20 [c03e991c] ? ip_queue_xmit+0x3dc/0x470 [c043641e] ? _spin_unlock_irqrestore+0x5e/0x70 [c0186be1] ? check_pad_bytes+0x61/0x80 [c03f6031] tcp_mark_head_lost+0x121/0x150 [c03f60ac] tcp_update_scoreboard+0x4c/0x170 [c03f6e0a] tcp_fastretrans_alert+0x48a/0x6b0 [c03f7d93] tcp_ack+0x1b3/0x3a0 [c03fa14b] tcp_rcv_established+0x3eb/0x710 [c04015c5] tcp_v4_do_rcv+0xe5/0x100 [c0401bbb] tcp_v4_rcv+0x5db/0x660 Doh, once more these S+L things..., the rest are symptom of the first problem. What is strange is that it doesn't show up until now, the last TCP changes that could have some significance are from early Dec/Nov. Is there some reason why you haven't seen this before this (e.g., not tested with similar cfg or so)? I'm a bit worried about its reproducability if it takes this far to see it... -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
On Sun, 13 Jan 2008, David Miller wrote: From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Mon, 14 Jan 2008 09:43:08 +0200 (EET) I'd prefer sending them as iso-8859-1 compliant (and I guess you are able to test your fix-to-utf-8 machinery with it as well :-)), as it would also make my mails compatible with other people's git apply tools you're not using (otherwise I'd probably forget to change it occassionally when interacting with others than you). For now either way is fine with me. If the situation changes I'll let you know. Ok, I'll remain in iso-8859-1, it's something that is known to work from my end. Thanks anyway for fixing it, wasn't any big deal for me at any point of time but people started asking me privately to correct it which I of course couldn't... :-) I'm surprised git-send-email can't get it purely utf8 correctly. The problem is that my system is ISO natively, so git-send-email might encode a ISO native .patch file's content while sending, which in this case was intentionally already utf-8. It might surprise you but it wasn't a long time ago when git-send-email wouldn't care less e.g. about header encoding and I got rejects from netdev due to my name which wasn't encoded properly, I've 1.5.0.6 currently and it seemed still fail to encode Cc addresses it adds from signed-offs unless I explicitly ask for it to not do that (I explicitly ask for especific, encoded, from header anyway because it was broken at some point of time and my sending template is copy-paste originating from that time). There was some recent fixes in the git's logs regarding that encoding, so I intend to check if a later g-s-e is more able and if it isn't I'll report it to git folks. I wonder if there is some issue with how it gets your name string for the commit author etc. I've had it working well since the encoding header got relatively recently added (wasn't available at early dawn of git era), before that it was just a mess locally. Funny enough, you were able to magically mangle my emails to utf-8'ed commits nicely back then so I got a fixed commit back after an RTT :-). I wonder if getting it into your global GIT config file in proper UTF8 encoding would fix things. Put something like this into ~/.gitconfig [user] name = Ilpo Järvinen email = [EMAIL PROTECTED] I have this. In addition I have this (required to make my local system consistent): [i18n] commitencoding = ISO-8859-1 The problem was just that the API (or better, ABI) between us wasn't properly working :-)). While Herbert was working as the replacement-Dave in November, correct commit entries were created, so git has been working fine (I guess he used git applying tools instead of handmade scripts and they handle email correclt based on it's encoding). I tried logOutputEncoding = utf-8 in the last patch sets I sent (now could again remove it) but git-send-email problem appeared with it because the system is ISO natively. The GIT maintainer is Finnish which makes this situation even more perplexing to me, you might want to discuss it with him :-) Junio? Never heard that a Finnish name... ;-) Perhaps git-send-email wasn't written by that Finnish guy... :-) ...Besides, that Finnish git aintainer doesn't have any funny characters in his name... ;-) Thanks anyway for the tips all, I think we have it now working and I can return to inlines and rexmit_skb_hint things other TCP stuff rather than this hinderance. I've some interesting results from net header inlines checks I ran overnight :-). -- i.
Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
On Fri, 11 Jan 2008, Zhang, Yanmin wrote: On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: The regression is: 1)stoakley with 2 qual-core processors: 11%; 2)Tulsa with 4 dual-core(+hyperThread) processors:13%; I have new update on this issue and also cc to netdev maillist. Thank David Miller for pointing me the netdev maillist. The test command is: #sudo taskset -c 7 ./netserver #sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -r 1,1 As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's regression is between 16%~11%. I tried to use bisect to locate the bad patch between 2.6.22 and 2.6.23-rc1, but the bisected kernel wasn't stable and went crazy. TCP work between that is very much non-existing. Using git-reset's to select a nearby merge point instead of default commit where bisection lands might be help in case the bisected kernel breaks. Also, limiting bisection under a subsystem might reduce probability of brokeness (might at least be able to narrow it down quite a lot), e.g. git bisect start net/ -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
On Mon, 14 Jan 2008, Ilpo Järvinen wrote: On Fri, 11 Jan 2008, Zhang, Yanmin wrote: On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's regression is between 16%~11%. I tried to use bisect to locate the bad patch between 2.6.22 and 2.6.23-rc1, but the bisected kernel wasn't stable and went crazy. TCP work between that is very much non-existing. I _really_ meant 2.6.22 - 2.6.23-rc1, not 2.6.24-rc1 in case you had a typo there which is not that uncommon while typing kernel versions... :-) -- i.
Re: [PATCH 1/8] [TCP]: Uninline tcp_set_state
On Sat, 12 Jan 2008, Stephen Hemminger wrote: On Sat, 12 Jan 2008 11:40:10 +0200 Ilpo Järvinen [EMAIL PROTECTED] wrote: ...snip... built-in.o: 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 ...snip... include/net/tcp.h | 35 +-- net/ipv4/tcp.c| 35 +++ 2 files changed, 36 insertions(+), 34 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 48081ad..306580c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -926,40 +926,7 @@ static const char *statename[]={ Close Wait,Last ACK,Listen,Closing }; #endif - -static inline void tcp_set_state(struct sock *sk, int state) -{ - int oldstate = sk-sk_state; - - switch (state) { - case TCP_ESTABLISHED: - if (oldstate != TCP_ESTABLISHED) - TCP_INC_STATS(TCP_MIB_CURRESTAB); - break; - - case TCP_CLOSE: - if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED) - TCP_INC_STATS(TCP_MIB_ESTABRESETS); - - sk-sk_prot-unhash(sk); - if (inet_csk(sk)-icsk_bind_hash - !(sk-sk_userlocks SOCK_BINDPORT_LOCK)) - inet_put_port(tcp_hashinfo, sk); - /* fall through */ - default: - if (oldstate==TCP_ESTABLISHED) - TCP_DEC_STATS(TCP_MIB_CURRESTAB); - } - - /* Change state AFTER socket is unhashed to avoid closed -* socket sitting in hash tables. -*/ - sk-sk_state = state; - -#ifdef STATE_TRACE - SOCK_DEBUG(sk, TCP sk=%p, State %s - %s\n,sk, statename[oldstate],statename[state]); -#endif -} Since the function is called with a constant state, I guess the assumption was that gcc would be smart enough to only include the code needed, it looks like either code was bigger or the compiler was dumber than expected I'd guess that compiler was just dumber... :-) It might be an interesting experiment to convert it to if's and see if it would make a difference, maybe it just gets confused by the switch or something. Besides, it not always that obvious that gcc is able to determine the constant state, considering e.g., the complexity in the cases with tcp_rcv_synsent_state_process, tcp_close_state, tcp_done. In such cases uninlining should be done and gcc is probably not able to mix both cases nicely for a single function? However, after looking a bit, I'm partially leaning towards the other option too: tcp_done| -145 tcp_disconnect | -141 ...These called for tcp_set_state just _once_, while this calls for it twice: tcp_fin | -86 ...Obviously the compiler was able to perform some reductions but 43 bytes per inlining seems still a bit high number. -- i.
Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
On Sat, 12 Jan 2008, David Miller wrote: From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Sat, 12 Jan 2008 14:59:50 +0200 (EET) ...Maybe I just fall-back to changing my last name, it's the only full-proof solution... ;-) Don't do this! Otherwise I won't have a frequent test case to make sure my patch applying scripts are working properly. :- So which test case you prefer? :-) Is iso-8859-1 from+content ok? Or should I keep trying to live with mixed utf-8 which I didn't got even fully working last time because git-send-email is probably either too dumb or too intelligent (I'm not even sure which), but you were able correct it by your tools so the flawed signed-off never entered to the git logs as incorrectly formatted :-). I'd prefer sending them as iso-8859-1 compliant (and I guess you are able to test your fix-to-utf-8 machinery with it as well :-)), as it would also make my mails compatible with other people's git apply tools you're not using (otherwise I'd probably forget to change it occassionally when interacting with others than you). -- i.
[PATCH net-2.6.25 0/8] [NET]: More uninlining related
Hi Dave, First of all, I changed output encoding of git to utf-8, so I guess the encoding should not cause the same trouble for you. Here are couple of more to uninline things. Pretty straightforward except the EXPORT_SYMBOLs, I've no idea which is the right variant (feel free to fix them while applying :-)). Also pktgen uninlining is somewhat questionable because it's just a testing tool so feel free to drop it if it feels unnecessary (I could have asked first but it's just as easy to do it this way if not easier)... There were more dead static inlines found after inlines removed (gcc didn't report them otherwise) than in pktgen now included, but I'm not sure if I should send by default patches removing or #if 0'ing them? -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25 0/8] [NET]: More uninlining related
Hi Dave, First of all, I changed output encoding to utf-8, so I guess the encoding should not cause trouble for you. Here are couple of more to uninline things. Pretty straightforward except the EXPORT_SYMBOLs, I've no idea which is the right variant (feel free to fix them while applying :-)). Also pktgen uninlining is somewhat questionable because it's just a testing tool so feel free to drop it if it feels unnecessary (I could have asked first but it's just as easy to do it this way if not easier)... There were more dead static inlines found after inlines removed (gcc didn't report them otherwise) than in pktgen now included, but I'm not sure if I should send by default patches removing or #if 0'ing them? -- i. ps. I apologize that I must resend to get them to netdev as well because git-send-email of this system (not sure if later could) still seems to be lacking proper encoding of my name when it decides to add it to Cc list all by itself and those 8-bit chars in address got rejected. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] [TCP]: Uninline tcp_is_cwnd_limited
net/ipv4/tcp_cong.c: tcp_reno_cong_avoid | -65 1 function changed, 65 bytes removed, diff: -65 net/ipv4/arp.c: arp_ignore | -5 1 function changed, 5 bytes removed, diff: -5 net/ipv4/tcp_bic.c: bictcp_cong_avoid | -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_cubic.c: bictcp_cong_avoid | -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_highspeed.c: hstcp_cong_avoid | -63 1 function changed, 63 bytes removed, diff: -63 net/ipv4/tcp_hybla.c: hybla_cong_avoid | -85 1 function changed, 85 bytes removed, diff: -85 net/ipv4/tcp_htcp.c: htcp_cong_avoid | -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_veno.c: tcp_veno_cong_avoid | -52 1 function changed, 52 bytes removed, diff: -52 net/ipv4/tcp_scalable.c: tcp_scalable_cong_avoid | -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_yeah.c: tcp_yeah_cong_avoid | -75 1 function changed, 75 bytes removed, diff: -75 net/ipv4/tcp_illinois.c: tcp_illinois_cong_avoid | -54 1 function changed, 54 bytes removed, diff: -54 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | -7 ccid3_hc_tx_packet_recv| +7 2 functions changed, 7 bytes added, 7 bytes removed, diff: +0 net/ipv4/tcp_cong.c: tcp_is_cwnd_limited | +88 1 function changed, 88 bytes added, diff: +88 built-in.o: 14 functions changed, 95 bytes added, 642 bytes removed, diff: -547 ...Again some gcc artifacts visible as well. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/net/tcp.h | 22 +- net/ipv4/tcp_cong.c | 21 + 2 files changed, 22 insertions(+), 21 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 306580c..7de4ea3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -786,27 +786,7 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp) { return tp-snd_una + tp-snd_wnd; } - -/* RFC2861 Check whether we are limited by application or congestion window - * This is the inverse of cwnd check in tcp_tso_should_defer - */ -static inline int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) -{ - const struct tcp_sock *tp = tcp_sk(sk); - u32 left; - - if (in_flight = tp-snd_cwnd) - return 1; - - if (!sk_can_gso(sk)) - return 0; - - left = tp-snd_cwnd - in_flight; - if (sysctl_tcp_tso_win_divisor) - return left * sysctl_tcp_tso_win_divisor tp-snd_cwnd; - else - return left = tcp_max_burst(tp); -} +extern int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight); static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss, const struct sk_buff *skb) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 4451750..3a6be23 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -274,6 +274,27 @@ int tcp_set_congestion_control(struct sock *sk, const char *name) return err; } +/* RFC2861 Check whether we are limited by application or congestion window + * This is the inverse of cwnd check in tcp_tso_should_defer + */ +int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) +{ + const struct tcp_sock *tp = tcp_sk(sk); + u32 left; + + if (in_flight = tp-snd_cwnd) + return 1; + + if (!sk_can_gso(sk)) + return 0; + + left = tp-snd_cwnd - in_flight; + if (sysctl_tcp_tso_win_divisor) + return left * sysctl_tcp_tso_win_divisor tp-snd_cwnd; + else + return left = tcp_max_burst(tp); +} +EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited); /* * Slow start is used when congestion window is less than slow start -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
net/core/pktgen.c: pktgen_stop_device | -50 pktgen_run | -105 pktgen_if_show | -37 pktgen_thread_worker | -702 4 functions changed, 894 bytes removed, diff: -894 net/core/pktgen.c: getCurUs | +36 1 function changed, 36 bytes added, diff: +36 net/core/pktgen.o: 5 functions changed, 36 bytes added, 894 bytes removed, diff: -858 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/core/pktgen.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index ebfb126..d18fdb1 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -405,7 +405,7 @@ static inline __u64 tv_to_us(const struct timeval *tv) return us; } -static inline __u64 getCurUs(void) +static __u64 getCurUs(void) { struct timeval tv; do_gettimeofday(tv); -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] [TCP]: Uninline tcp_set_state
net/ipv4/tcp.c: tcp_close_state | -226 tcp_done| -145 tcp_close | -564 tcp_disconnect | -141 4 functions changed, 1076 bytes removed, diff: -1076 net/ipv4/tcp_input.c: tcp_fin | -86 tcp_rcv_state_process | -164 2 functions changed, 250 bytes removed, diff: -250 net/ipv4/tcp_ipv4.c: tcp_v4_connect | -209 1 function changed, 209 bytes removed, diff: -209 net/ipv4/arp.c: arp_ignore | +5 1 function changed, 5 bytes added, diff: +5 net/ipv6/tcp_ipv6.c: tcp_v6_connect | -158 1 function changed, 158 bytes removed, diff: -158 net/sunrpc/xprtsock.c: xs_sendpages | -2 1 function changed, 2 bytes removed, diff: -2 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | +7 1 function changed, 7 bytes added, diff: +7 net/ipv4/tcp.c: tcp_set_state | +238 1 function changed, 238 bytes added, diff: +238 built-in.o: 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 I've no explanation why some unrelated changes seem to occur consistently as well (arp_ignore, ccid3_update_send_interval; I checked the arp_ignore asm and it seems to be due to some reordered of operation order causing some extra opcodes to be generated). Still, the benefits are pretty obvious from the codiff's results. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Cc: Andi Kleen [EMAIL PROTECTED] --- include/net/tcp.h | 35 +-- net/ipv4/tcp.c| 35 +++ 2 files changed, 36 insertions(+), 34 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 48081ad..306580c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -926,40 +926,7 @@ static const char *statename[]={ Close Wait,Last ACK,Listen,Closing }; #endif - -static inline void tcp_set_state(struct sock *sk, int state) -{ - int oldstate = sk-sk_state; - - switch (state) { - case TCP_ESTABLISHED: - if (oldstate != TCP_ESTABLISHED) - TCP_INC_STATS(TCP_MIB_CURRESTAB); - break; - - case TCP_CLOSE: - if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED) - TCP_INC_STATS(TCP_MIB_ESTABRESETS); - - sk-sk_prot-unhash(sk); - if (inet_csk(sk)-icsk_bind_hash - !(sk-sk_userlocks SOCK_BINDPORT_LOCK)) - inet_put_port(tcp_hashinfo, sk); - /* fall through */ - default: - if (oldstate==TCP_ESTABLISHED) - TCP_DEC_STATS(TCP_MIB_CURRESTAB); - } - - /* Change state AFTER socket is unhashed to avoid closed -* socket sitting in hash tables. -*/ - sk-sk_state = state; - -#ifdef STATE_TRACE - SOCK_DEBUG(sk, TCP sk=%p, State %s - %s\n,sk, statename[oldstate],statename[state]); -#endif -} +extern void tcp_set_state(struct sock *sk, int state); extern void tcp_done(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 34085e3..7d7b958 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1652,6 +1652,41 @@ recv_urg: goto out; } +void tcp_set_state(struct sock *sk, int state) +{ + int oldstate = sk-sk_state; + + switch (state) { + case TCP_ESTABLISHED: + if (oldstate != TCP_ESTABLISHED) + TCP_INC_STATS(TCP_MIB_CURRESTAB); + break; + + case TCP_CLOSE: + if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED) + TCP_INC_STATS(TCP_MIB_ESTABRESETS); + + sk-sk_prot-unhash(sk); + if (inet_csk(sk)-icsk_bind_hash + !(sk-sk_userlocks SOCK_BINDPORT_LOCK)) + inet_put_port(tcp_hashinfo, sk); + /* fall through */ + default: + if (oldstate==TCP_ESTABLISHED) + TCP_DEC_STATS(TCP_MIB_CURRESTAB); + } + + /* Change state AFTER socket is unhashed to avoid closed +* socket sitting in hash tables. +*/ + sk-sk_state = state; + +#ifdef STATE_TRACE + SOCK_DEBUG(sk, TCP sk=%p, State %s - %s\n,sk, statename[oldstate],statename[state]); +#endif +} +EXPORT_SYMBOL_GPL(tcp_set_state); + /* * State processing on a close. This implements the state shift for * sending our FIN frame. Note that we only send a FIN for some -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] [IPV6] route: kill some bloat
net/ipv6/route.c: ip6_pkt_prohibit_out | -130 ip6_pkt_discard | -261 ip6_pkt_discard_out | -130 ip6_pkt_prohibit | -261 4 functions changed, 782 bytes removed, diff: -782 net/ipv6/route.c: ip6_pkt_drop | +300 1 function changed, 300 bytes added, diff: +300 net/ipv6/route.o: 5 functions changed, 300 bytes added, 782 bytes removed, diff: -482 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv6/route.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 02354a7..d3b5811 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1762,8 +1762,7 @@ int ipv6_route_ioctl(unsigned int cmd, void __user *arg) * Drop the packet on the floor */ -static inline int ip6_pkt_drop(struct sk_buff *skb, int code, - int ipstats_mib_noroutes) +static int ip6_pkt_drop(struct sk_buff *skb, int code, int ipstats_mib_noroutes) { int type; switch (ipstats_mib_noroutes) { -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] [NETLINK] af_netlink: kill some bloat
net/netlink/af_netlink.c: netlink_realloc_groups| -46 netlink_insert| -49 netlink_autobind | -94 netlink_clear_multicast_users | -48 netlink_bind | -55 netlink_setsockopt| -54 netlink_release | -86 netlink_kernel_create | -47 netlink_change_ngroups| -56 9 functions changed, 535 bytes removed, diff: -535 net/netlink/af_netlink.c: netlink_table_ungrab | +53 1 function changed, 53 bytes added, diff: +53 net/netlink/af_netlink.o: 10 functions changed, 53 bytes added, 535 bytes removed, diff: -482 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/netlink/af_netlink.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index be07f1b..21f9e30 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -193,7 +193,7 @@ static void netlink_table_grab(void) } } -static inline void netlink_table_ungrab(void) +static void netlink_table_ungrab(void) __releases(nl_table_lock) { write_unlock_irq(nl_table_lock); -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] [PKTGEN]: Kill dead static inlines
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/core/pktgen.c | 94 - 1 files changed, 0 insertions(+), 94 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index ede1fea..ebfb126 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -397,62 +397,6 @@ struct pktgen_thread { #define REMOVE 1 #define FIND 0 -/* This code works around the fact that do_div cannot handle two 64-bit -numbers, and regular 64-bit division doesn't work on x86 kernels. ---Ben -*/ - -#define PG_DIV 0 - -/* This was emailed to LMKL by: Chris Caputo [EMAIL PROTECTED] - * Function copied/adapted/optimized from: - * - * nemesis.sourceforge.net/browse/lib/static/intmath/ix86/intmath.c.html - * - * Copyright 1994, University of Cambridge Computer Laboratory - * All Rights Reserved. - * - */ -static inline s64 divremdi3(s64 x, s64 y, int type) -{ - u64 a = (x 0) ? -x : x; - u64 b = (y 0) ? -y : y; - u64 res = 0, d = 1; - - if (b 0) { - while (b a) { - b = 1; - d = 1; - } - } - - do { - if (a = b) { - a -= b; - res += d; - } - b = 1; - d = 1; - } - while (d); - - if (PG_DIV == type) { - return (((x ^ y) (1ll 63)) == 0) ? res : -(s64) res; - } else { - return ((x (1ll 63)) == 0) ? a : -(s64) a; - } -} - -/* End of hacks to deal with 64-bit math on x86 */ - -/** Convert to milliseconds */ -static inline __u64 tv_to_ms(const struct timeval *tv) -{ - __u64 ms = tv-tv_usec / 1000; - ms += (__u64) tv-tv_sec * (__u64) 1000; - return ms; -} - /** Convert to micro-seconds */ static inline __u64 tv_to_us(const struct timeval *tv) { @@ -461,39 +405,6 @@ static inline __u64 tv_to_us(const struct timeval *tv) return us; } -static inline __u64 pg_div(__u64 n, __u32 base) -{ - __u64 tmp = n; - do_div(tmp, base); - /* printk(pktgen: pg_div, n: %llu base: %d rv: %llu\n, - n, base, tmp); */ - return tmp; -} - -static inline __u64 pg_div64(__u64 n, __u64 base) -{ - __u64 tmp = n; -/* - * How do we know if the architecture we are running on - * supports division with 64 bit base? - * - */ -#if defined(__sparc_v9__) || defined(__powerpc64__) || defined(__alpha__) || defined(__x86_64__) || defined(__ia64__) - - do_div(tmp, base); -#else - tmp = divremdi3(n, base, PG_DIV); -#endif - return tmp; -} - -static inline __u64 getCurMs(void) -{ - struct timeval tv; - do_gettimeofday(tv); - return tv_to_ms(tv); -} - static inline __u64 getCurUs(void) { struct timeval tv; @@ -501,11 +412,6 @@ static inline __u64 getCurUs(void) return tv_to_us(tv); } -static inline __u64 tv_diff(const struct timeval *a, const struct timeval *b) -{ - return tv_to_us(a) - tv_to_us(b); -} - /* old include end */ static char version[] __initdata = VERSION; -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] [NETFILTER] xt_policy.c: kill some bloat
net/netfilter/xt_policy.c: policy_mt | -906 1 function changed, 906 bytes removed, diff: -906 net/netfilter/xt_policy.c: match_xfrm_state | +427 1 function changed, 427 bytes added, diff: +427 net/netfilter/xt_policy.o: 2 functions changed, 427 bytes added, 906 bytes removed, diff: -479 Alternatively, this could be done by combining identical parts of the match_policy_in/out() Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/netfilter/xt_policy.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c index 46ee7e8..45731ca 100644 --- a/net/netfilter/xt_policy.c +++ b/net/netfilter/xt_policy.c @@ -33,7 +33,7 @@ xt_addr_cmp(const union xt_policy_addr *a1, const union xt_policy_addr *m, return false; } -static inline bool +static bool match_xfrm_state(const struct xfrm_state *x, const struct xt_policy_elem *e, unsigned short family) { -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] [XFRM] xfrm_policy: kill some bloat
net/xfrm/xfrm_policy.c: xfrm_audit_policy_delete | -692 xfrm_audit_policy_add| -692 2 functions changed, 1384 bytes removed, diff: -1384 net/xfrm/xfrm_policy.c: xfrm_audit_common_policyinfo | +704 1 function changed, 704 bytes added, diff: +704 net/xfrm/xfrm_policy.o: 3 functions changed, 704 bytes added, 1384 bytes removed, diff: -680 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/xfrm/xfrm_policy.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 534b29e..47219f9 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2367,8 +2367,8 @@ void __init xfrm_init(void) } #ifdef CONFIG_AUDITSYSCALL -static inline void xfrm_audit_common_policyinfo(struct xfrm_policy *xp, - struct audit_buffer *audit_buf) +static void xfrm_audit_common_policyinfo(struct xfrm_policy *xp, +struct audit_buffer *audit_buf) { struct xfrm_sec_ctx *ctx = xp-security; struct xfrm_selector *sel = xp-selector; -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [HTB]: htb_classid is dead static inline
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/sched/sch_htb.c |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 72beb66..6a2352c 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -214,10 +214,6 @@ static inline struct htb_class *htb_find(u32 handle, struct Qdisc *sch) * then finish and return direct queue. */ #define HTB_DIRECT (struct htb_class*)-1 -static inline u32 htb_classid(struct htb_class *cl) -{ - return (cl cl != HTB_DIRECT) ? cl-classid : TC_H_UNSPEC; -} static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [NET] core/utils.c: digit2bin is dead static inline
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/core/utils.c | 11 --- 1 files changed, 0 insertions(+), 11 deletions(-) diff --git a/net/core/utils.c b/net/core/utils.c index 34459c4..8031eb5 100644 --- a/net/core/utils.c +++ b/net/core/utils.c @@ -91,17 +91,6 @@ EXPORT_SYMBOL(in_aton); #define IN6PTON_NULL 0x2000 /* first/tail */ #define IN6PTON_UNKNOWN0x4000 -static inline int digit2bin(char c, int delim) -{ - if (c == delim || c == '\0') - return IN6PTON_DELIM; - if (c == '.') - return IN6PTON_DOT; - if (c = '0' c = '9') - return (IN6PTON_DIGIT | (c - '0')); - return IN6PTON_UNKNOWN; -} - static inline int xdigit2bin(char c, int delim) { if (c == delim || c == '\0') -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
On Sat, 12 Jan 2008, Herbert Xu wrote: On Sat, Jan 12, 2008 at 09:40:17AM +, Ilpo Järvinen wrote: Your emails are now using UTF-8 encoding but it's still declaring ISO-8859-1 as the charset. Thanks for trying to help but my situation is such that I think it got also you confused (this kind of mixed encoding is beoynd my skills really)... :-) Besides, I wouldn't mind of having incorrect characters in my name, I'm just used to that but somebody else wasn't that happy about it. Here's one example... From: =?utf-8?q?Ilpo_J=C3=A4rvinen?= [EMAIL PROTECTED] ... Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Something still needed besides these to declare email utf-8? So you probably want to fix that up or your name may show up as Jävinen on the reader's screen. Did you actually see this? I'd expect to see that as well but no, From is correctly decoded by my ISO-8859-1'ish MUA, aha, seems that I still have something to do to deal with the Signed-off line. ...Maybe I just fall-back to changing my last name, it's the only full-proof solution... ;-) -- i.
TCP hints
Hi Stephen, Do you still remember what this is for (got added along with other TCP hint stuff)? What kind of problem you saw back then (or who saw problems)? @@ -1605,6 +1711,10 @@ static void tcp_undo_cwr(struct sock *sk, const int undo) } tcp_moderate_cwnd(tp); tp-snd_cwnd_stamp = tcp_time_stamp; + + /* There is something screwy going on with the retrans hints after + an undo */ + clear_all_retrans_hints(tp); } static inline int tcp_may_undo(struct tcp_sock *tp) -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Tue, 8 Jan 2008, Ilpo Järvinen wrote: On Mon, 7 Jan 2008, David Miller wrote: From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 8 Jan 2008 06:00:07 +0100 On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote: The vast majority of them are one, two, and three liners. % awk ' { line++ } ; /^{/ { total++; start = line } ; /^}/ { len=line-start-3; if (len 4) l++; if (len = 10) k++; } ; END { print total, l, l/total, k, k/total }' include/net/tcp.h 68 28 0.411765 20 0.294118 41% are over 4 lines, 29% are = 10 lines. Take out the comments and whitespace lines, your script is too simplistic. In addition it triggered spuriously per struct/enum end brace :-) and was using the last known function starting brace in there so no wonder the numbers were that high... Counting with the corrected lines (len=line-start-1) spurious matches removed: 74 19 0.256757 7 0.0945946 Here are (finally) the measured bytes (couple of the functions are missing because I had couple of bugs in the regexps and the #if trickery at the inline resulted failed compiles): 12 funcs, 242+, 1697-, diff: -1455 tcp_set_state 13 funcs, 92+, 632-, diff: -540 tcp_is_cwnd_limited 12 funcs, 2836+, 3225-, diff: -389 tcp_current_ssthresh 5 funcs, 261+, 556-, diff: -295 tcp_prequeue 7 funcs, 2777+, 3049-, diff: -272 tcp_clear_retrans_hints_partial 11 funcs, 64+, 275-, diff: -211 tcp_win_from_space 6 funcs, 128+, 320-, diff: -192 tcp_prequeue_init 12 funcs, 45+, 209-, diff: -164 tcp_set_ca_state 7 funcs, 106+, 237-, diff: -131 tcp_fast_path_check 5 funcs, 167+, 291-, diff: -124 tcp_write_queue_purge 6 funcs, 43+, 160-, diff: -117 tcp_push_pending_frames 9 funcs, 55+, 159-, diff: -104 tcp_v4_check 6 funcs, 4+, 97-, diff: -93 tcp_packets_in_flight 7 funcs, 58+, 150-, diff: -92 tcp_fast_path_on 4 funcs, 4+, 91-, diff: -87 tcp_clear_options 6 funcs, 141+, 217-, diff: -76 tcp_openreq_init 8 funcs, 38+, 111-, diff: -73 tcp_unlink_write_queue 7 funcs, 32+, 103-, diff: -71 tcp_checksum_complete 7 funcs, 35+, 101-, diff: -66 __tcp_fast_path_on 5 funcs, 4+, 66-, diff: -62 tcp_receive_window 6 funcs, 67+, 128-, diff: -61 tcp_add_write_queue_tail 7 funcs, 30+, 86-, diff: -56tcp_ca_event 6 funcs, 73+, 106-, diff: -33 tcp_paws_check 4 funcs, 4+, 36-, diff: -32 tcp_highest_sack_seq 6 funcs, 46+, 78-, diff: -32tcp_fin_time 3 funcs, 4+, 35-, diff: -31 tcp_clear_all_retrans_hints 7 funcs, 30+, 51-, diff: -21__tcp_add_write_queue_tail 3 funcs, 4+, 14-, diff: -10 tcp_enable_fack 4 funcs, 4+, 14-, diff: -10 keepalive_time_when 8 funcs, 66+, 73-, diff: -7 tcp_full_space 3 funcs, 4+, 5-, diff: -1 tcp_wnd_end 4 funcs, 97+, 97-, diff: +0 tcp_mib_init 3 funcs, 4+, 3-, diff: +1 tcp_skb_is_last 2 funcs, 4+, 2-, diff: +2 keepalive_intvl_when 2 funcs, 4+, 2-, diff: +2 tcp_is_fack 2 funcs, 4+, 2-, diff: +2 tcp_skb_mss 2 funcs, 4+, 2-, diff: +2 tcp_write_queue_empty 2 funcs, 4+, 2-, diff: +2 tcp_advance_highest_sack 2 funcs, 4+, 2-, diff: +2 tcp_advance_send_head 2 funcs, 4+, 2-, diff: +2 tcp_check_send_head 2 funcs, 4+, 2-, diff: +2 tcp_highest_sack_reset 2 funcs, 4+, 2-, diff: +2 tcp_init_send_head 2 funcs, 4+, 2-, diff: +2 tcp_sack_reset 6 funcs, 47+, 44-, diff: +3 tcp_space 5 funcs, 55+, 50-, diff: +5 tcp_too_many_orphans 3 funcs, 8+, 2-, diff: +6 tcp_minshall_update 3 funcs, 8+, 2-, diff: +6 tcp_update_wl 8 funcs, 25+, 14-, diff: +11between 3 funcs, 14+, 2-, diff: +12 tcp_put_md5sig_pool 3 funcs, 14+, 2-, diff: +12 tcp_clear_xmit_timers 5 funcs, 30+, 17-, diff: +13tcp_dec_pcount_approx_int 6 funcs, 33+, 20-, diff: +13tcp_insert_write_queue_after 3 funcs, 17+, 2-, diff: +15 __tcp_checksum_complete 5 funcs, 17+, 2-, diff: +15 tcp_init_wl 4 funcs, 57+, 41-, diff: +16tcp_dec_quickack_mode 4 funcs, 40+, 22-, diff: +18__tcp_add_write_queue_head 5 funcs, 36+, 16-, diff: +20tcp_highest_sack_combine 4 funcs, 40+, 18-, diff: +22tcp_dec_pcount_approx 6 funcs, 29+, 5-, diff: +24 tcp_is_sack 4 funcs, 28+, 2-, diff: +26 tcp_is_reno 5 funcs, 50+, 24-, diff: +26tcp_insert_write_queue_before 4 funcs, 83+, 56-, diff: +27tcp_check_probe_timer 8 funcs, 69+, 14-, diff: +55tcp_left_out 11 funcs, 2995+, 2893-, diff: +102 tcp_skb_pcount 30 funcs, 930+, 2-, diff: +928 before -- i.
Re: [PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat
On Sat, 5 Jan 2008, Arnaldo Carvalho de Melo wrote: Em Sat, Jan 05, 2008 at 03:39:04PM +0200, Ilpo Järvinen escreveu: Hi Dave, After Arnaldo got codiff's inline instrumentation bugs fixed (thanks! :-)), I got my .c-inline-bloat-o-meter to power up reliably after some tweaking and bug fixing on my behalf... It shows some very high readings every now and then in the code under net/. Thank you for the reports and for showing how these tools can be put to good use! If you have any further suggestions on how to make codiff and the dwarves to be of more help or find any other bug, please let me know. It could use a bit less memory because my header inline checking attempt on a machine with 2G+2G ends up like this: $ codiff vmlinux.o.base vmlinux.o libclasses: out of memory(inline_expansion__new) $ ls -al vmlinux.o{,.base} -rw-r- 1 ijjarvin tkol 633132586 Jan 9 13:11 vmlinux.o -rw-r- 1 ijjarvin tkol 633132572 Jan 9 00:58 vmlinux.o.base $ Considering that's only 0.6G+0.6G I've a problem in understanding why codiff's number crunching eats up so much memory. I just hope there isn't any O(n^2) or worse algos in it either once the memory consumption gets resolved. :-) -- i.
Re: SACK scoreboard
On Tue, 8 Jan 2008, John Heffner wrote: Andi Kleen wrote: David Miller [EMAIL PROTECTED] writes: The big problem is that recovery from even a single packet loss in a window makes us run kfree_skb() for a all the packets in a full window's worth of data when recovery completes. Why exactly is it a problem to free them all at once? Are you worried about kernel preemption latencies? I also wonder how much of a problem this is (for now, with window sizes of order 1 packets. My understanding is that the biggest problems arise from O(N^2) time for recovery because every ack was expensive. Have current tests shown the final ack to be a major source of problems? This thread got started because I tried to solve the other latencies but realized that it helps very little because this latency spike would have remained unsolved and it happens in one of the most common case. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Mon, 7 Jan 2008, David Miller wrote: From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 8 Jan 2008 06:00:07 +0100 On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote: The vast majority of them are one, two, and three liners. % awk ' { line++ } ; /^{/ { total++; start = line } ; /^}/ { len=line-start-3; if (len 4) l++; if (len = 10) k++; } ; END { print total, l, l/total, k, k/total }' include/net/tcp.h 68 28 0.411765 20 0.294118 41% are over 4 lines, 29% are = 10 lines. Take out the comments and whitespace lines, your script is too simplistic. He should also remove these, which involve just syntactic casting to please the compiler: struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); ...and maybe some other similar ones. I'm sure that's going to make his statistics even more questionable as is because we need tp all around. But about his concerns, I spend couple of minutes in looking to it by hand. These are likely to be a win: tcp_set_state tcp_prequeue tcp_prequeue_init tcp_openreq_init ...about rest I'm very unsure but there are some that probably are a minor win as well. tcp_dec_quickack_mode has only a single caller anyway so I'd doubt that it's going to be significant, I thought moving it to .c earlier but haven't just done that yet :-). -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Tue, 8 Jan 2008, Andi Kleen wrote: David Miller [EMAIL PROTECTED] writes: Similarly I question just about any inline usage at all in *.c files Don't forget the .h files. Especially a lot of stuff in tcp.h should be probably in some .c file and not be inline. I'm not forgetting them... :-) My intention is to do the same for all headers as well but I'd really like to get some actual number of bytes by measuring rather than taking them of the hat. ...It just requires some work still to get the uninlining actually do the right thing and actually testing it involves orders of magnitude more exhaustive recompilation than .o only checking required so after finishing the script I either has to setup n distcc master with a number of slaves each or wait for some time for the results :-). However, I suspect that anything in tcp.h won't match to something like ~5-10k like I've seen in three .c files already + one with debugging that has even more than 10k (even if all in tcp.h summed up :-)). -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SACK scoreboard
On Mon, 7 Jan 2008, David Miller wrote: Did you happen to read a recent blog posting of mine? http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead I've been thinking more and more and I think we might be able to get away with enforcing that SACKs are always increasing in coverage. I doubt there are any real systems out there that drop out of order packets that are properly formed and are in window, even though the SACK specification (foolishly, in my opinion) allows this. Luckily we can see that already from MIBs, so quering people who have large servers, which are continously testing the internet :-), under their supervision or can access, and asking if they see any might help. I checked my dept's interactive servers and all had zero renegings, but I don't think I have access to www server which would have much wider exposure. If we could free packets as SACK blocks cover them, all the problems go away. I thought it a bit yesterday after reading your blog and came to conclusion that they won't, we can still get those nasty ACKs regardless of received SACK info (in here, missing). Even in some valid cases which include ACK losses besides actual data loss, not that this is the most common case but just wanted to point out that cleanup work is at least partially independent of SACK problem. So not all problems would go away really. For one thing, this will allow the retransmit queue liberation during loss recovery to be spread out over the event, instead of batched up like crazy to the point where the cumulative ACK finally moves and releases an entire window's worth of data. Two key cases for real pattern are: 1. Losses once per n, where n is something small, like 2-20 or so, usually happens at slow start overshoot or when compething traffic slow starts. Cumulative ACKs will cover only small part of the window once rexmits make through, thus this is not a problem. 2. Single loss (or few at the beginning of the window), rest SACKed. Cumulative ACK will cover original window when the last necessary rexmit gets through. Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case the arriving SACK blocks make the rest of the window equal to 2 :-). So I'm now trying to solve just case 2. What if we could somehow combine adjacent skbs (or whatever they're called in that model) if SACK covers them both so that we still hold them but can drop them in a very efficient way. That would make the combining effort split per ACK. And if reneging would occur, we can think a way to put the necessary fuzz into a form which cannot hurt the rest of the system (relatively easy fast if we add CA_Reneging and allow retransmitting a portion of an skb similar to what you suggested earlier). And it might even be possible then to offer admin a control so that the admin can choose between recover/plain reset if admin thinks that it's always an indication of an attack. This is somewhat similar case to what UTO (under IETF evaluation) does, as purpose of both is in violation of RFC TCP to avoid malicious traps but the control about it is left to the user. Next, it would simplify all of this scanning code trying to figure out which holes to fill during recovery. And for SACK scoreboard marking, the RB trie would become very nearly unecessary as far as I can tell. I've been contacted by a person who was interested in reaching 500k windows, so your 4000 sounded like a joke :-/. Having, let say, every 20th dropped means 25k skbs remaining, can we scan though it in any sensible time without RBs and friends :-)? However, allowing queue walk to begin from either direction would solve most of the common cases well enough for it to be nearly manageable. I would not even entertain this kind of crazy idea unless I thought the fundamental complexity simplification payback was enormous. And in this case I think it is. What we could do is put some experimental hack in there for developers to start playing with, which would enforce that SACKs always increase in coverage. If violated the connection reset and a verbose log message is logged so we can analyze any cases that occur. We have an initial number already, in MIBs. Sounds crazy, but maybe has potential. What do you think? If I'd hint my boss that I'm involved in something like this I'd bet that he also would get quite crazy... ;-) I'm partially paid for making TCP more RFCish :-), or at least that the places where thing diverge are known and controllable for research purposes. -- i. ps. If other Cced would like to get dropped if there are some followups, just let me know :-). Else, no need to do anything. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Sun, 6 Jan 2008, Herbert Xu wrote: is definitely not a fast path. If a function ends up being called just once the compiler will most likely inline it anyway, making the use of the keyword inline redundant. Unexpected enough, even this logic seems to fail in a way with my gcc, I'm yet to study it closer but it seems to me that e.g., uninlining only once called tcp_fastretrans_alert is worth of at least 100 bytes (note that it's not inlined by us, gcc did it all by itself)! Otherwise I'd fail to understand why I got -270 bytes from uninlining tcp_moderate_cwnd which is only 57 bytes as unlined with three call sites. net/ipv4/tcp_input.c: tcp_undo_cwr | -48 tcp_try_undo_recovery | -55 tcp_ack | -2941 3 functions changed, 3044 bytes removed, diff: -3044 net/ipv4/tcp_input.c: tcp_moderate_cwnd | +57 tcp_fastretrans_alert | +2717 2 functions changed, 2774 bytes added, diff: +2774 net/ipv4/tcp_input.o: 5 functions changed, 2774 bytes added, 3044 bytes removed, diff: -270 I'll probably force uninlining of it without tcp_moderate_cwnd noise and try a number of gcc versions. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Sat, 5 Jan 2008, David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Sun, 06 Jan 2008 11:29:35 +1100 We should never use inline except when it's on the fast path and this is definitely not a fast path. If a function ends up being called just once the compiler will most likely inline it anyway, making the use of the keyword inline redundant. Similarly I question just about any inline usage at all in *.c files these days. I even would discourage it's use for fast-path cases as well. There still seems to be good candidates for inline in *.c files, in worst case I had +172 due to inline removed and ~60 are on +10-+90 range with my gcc, later gccs might do better but I definately would just blindly remove them all. Here's the other end of the list: +35 static inline hci_encrypt_change_evtnet/bluetooth/hci_event.c +36 static __inline__ tcp_in_window net/ipv4/tcp_minisocks.c +38 static inline hci_conn_complete_evt net/bluetooth/hci_event.c +38 static inline hci_conn_request_evt net/bluetooth/hci_event.c +42 static inline gred_wred_modenet/sched/sch_gred.c +45 static inline secpath_has_nontransport net/xfrm/xfrm_policy.c +52 static inline bool port_match net/netfilter/xt_tcpudp.c +67 static inline dn_check_idf net/decnet/dn_nsp_in.c +90 static inline __ieee80211_queue_stopped net/mac80211/tx.c +172 static inline sctp_chunk_length_valid net/sctp/sm_statefuns.c -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25] [NET]: Remove obsolete comment
[PATCH] [NET]: Remove obsolete comment It seems that ip_build_xmit is no longer used in here and ip_append_data is used. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- Hi Dave, While reading some nearby code, I noticed this. I'm not 100% sure if the removal is valid or not nor have interest to dig it up from history but I suppose you know it better without having to look it up from history :-). net/ipv4/ip_output.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 3dc0c12..e57de0f 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1339,8 +1339,6 @@ static int ip_reply_glue_bits(void *dptr, char *to, int offset, * * Should run single threaded per socket because it uses the sock * structure to pass arguments. - * - * LATER: switch from ip_build_xmit to ip_append_* */ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *arg, unsigned int len) -- 1.5.0.6
[PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat
Hi Dave, After Arnaldo got codiff's inline instrumentation bugs fixed (thanks! :-)), I got my .c-inline-bloat-o-meter to power up reliably after some tweaking and bug fixing on my behalf... It shows some very high readings every now and then in the code under net/. ...Aand... we've a sovereign winner, though it was only fifth on a kernel wide list (arch/ excluded due to number of reasons) :-/. Hall of (unquestionable) fame (measured per inline, top 10 under net/): -4496 ctnetlink_parse_tuplenetfilter/nf_conntrack_netlink.c -2165 ctnetlink_dump_tuplesnetfilter/nf_conntrack_netlink.c -2115 __ip_vs_get_out_rt ipv4/ipvs/ip_vs_xmit.c -1924 xfrm_audit_helper_pktinfoxfrm/xfrm_state.c -1799 ctnetlink_parse_tuple_proto netfilter/nf_conntrack_netlink.c -1268 ctnetlink_parse_tuple_ip netfilter/nf_conntrack_netlink.c -1093 ctnetlink_exp_dump_expectnetfilter/nf_conntrack_netlink.c -1060 ccid3_update_send_interval dccp/ccids/ccid3.c -983 ctnetlink_dump_tuples_proto netfilter/nf_conntrack_netlink.c -827 ctnetlink_exp_dump_tuple netfilter/nf_conntrack_netlink.c Removing inlines is done iteratively because e.g., uninlining ctnetlink_parse_tuple affected ..._{proto,ip} prices as well. i386/gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)/allyesconfig except CONFIG_FORCED_INLINING. Tried without some CONFIG.*DEBUG as well and got slightly better numbers for some functions, yet the number don't differ enough to be that meaningful, ie., if there's bloat somewhere, removing DEBUGs won't make it go away. ...After this first aid we're at least below 1k :-). -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] [IPVS]: Kill some bloat
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= [EMAIL PROTECTED] net/ipv4/ipvs/ip_vs_xmit.c: ip_vs_icmp_xmit | -638 ip_vs_tunnel_xmit | -674 ip_vs_nat_xmit| -716 ip_vs_dr_xmit | -682 4 functions changed, 2710 bytes removed, diff: -2710 net/ipv4/ipvs/ip_vs_xmit.c: __ip_vs_get_out_rt | +595 1 function changed, 595 bytes added, diff: +595 net/ipv4/ipvs/ip_vs_xmit.o: 5 functions changed, 595 bytes added, 2710 bytes removed, diff: -2115 Without some CONFIG.*DEBUGs: net/ipv4/ipvs/ip_vs_xmit.o: 5 functions changed, 383 bytes added, 1513 bytes removed, diff: -1130 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/ipvs/ip_vs_xmit.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/ipvs/ip_vs_xmit.c b/net/ipv4/ipvs/ip_vs_xmit.c index 1e96bf8..8436bf8 100644 --- a/net/ipv4/ipvs/ip_vs_xmit.c +++ b/net/ipv4/ipvs/ip_vs_xmit.c @@ -59,7 +59,7 @@ __ip_vs_dst_check(struct ip_vs_dest *dest, u32 rtos, u32 cookie) return dst; } -static inline struct rtable * +static struct rtable * __ip_vs_get_out_rt(struct ip_vs_conn *cp, u32 rtos) { struct rtable *rt; /* Route to the other host */ -- 1.5.0.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html