from:"Ilpo Järvinen"

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-07-02 Thread Ilpo Järvinen

On Sat, 30 Jun 2018, Neal Cardwell wrote:

> As I mentioned, I ran your patch through all our team's TCP
> packetdrill tests, and it passes all of the tests. One of our tests
> needed updating, because if there is a non-SACK connection with a
> spurious RTO due to a delayed flight of ACKs then the FRTO undo now
> happens one ACK later (when we get an ACK that doesn't cover a
> retransmit). But that seems fine to me.

Yes, this is what is wanted. The non-SACK FRTO cannot make decision on 
the first cumulative ACK because that could be (often is) triggered by the 
retransmit but only from the next ACK after that.

Even with SACK FRTO, there is a hazard on doing it that early as tail ACK 
losses can lead to discovery of newly SACKed skbs from ACK of the
retransmitted segment. For that to occur, however, the cumulative ACK 
cannot cover those skbs implying more holes that need to be recovered. 
Therefore, the window reduction will eventually occur anyway but it would 
still first do a bogus undo also in that case.

> I also cooked the new packetdrill test below to explicitly cover this
> case you are addressing (please let me know if you have an alternate
> suggestion).
> 
> Tested-by: Neal Cardwell 
> Acked-by: Neal Cardwell 
> 
> Thanks!
> neal
> 
> ---
> 
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
> 
>+0 < S 0:0(0) win 32792 
>+0 > S. 0:0(0) ack 1 
>  +.02 < . 1:1(0) ack 1 win 257
>+0 accept(3, ..., ...) = 4
> 
> // Send 3 packets. First is really lost. And the dupacks
> // for the data packets that arrived at the reciver are slow in arriving.
>+0 write(4, ..., 3000) = 3000
>+0 > P. 1:3001(3000) ack 1
> 
> // RTO and retransmit head. This fills a real loss.
>  +.22 > . 1:1001(1000) ack 1
> 
> // Dupacks for packets 2 and 3 arrive.
> +.02  < . 1:1(0) ack 1 win 257
>+0 < . 1:1(0) ack 1 win 257
> 
> // The cumulative ACK for all the data arrives. We do not undo, because
> // this is a non-SACK connection, and retransmitted data was ACKed.
> // It's good that there's no FRTO undo, since a packet was really lost.
> // Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss
> // until something beyond high_seq is ACKed.
> +.005 < . 1:1(0) ack 3001 win 257
>+0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
>+0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }%
>+0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

I think that the snd_cwnd is still fishy there but that would 
require also the other patch from my series (cwnd was 1 so it should be 2 
after the cumulative ACK).

-- 
 i.

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-29 Thread Ilpo Järvinen

On Fri, 29 Jun 2018, Neal Cardwell wrote:

> On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen  
> wrote:
> >
> > If SACK is not enabled and the first cumulative ACK after the RTO
> > retransmission covers more than the retransmitted skb, a spurious
> > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > The reason is that any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case so the check for that bit won't help like it does with SACK).
> > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > in tcp_process_loss.
> >
> > We need to use more strict condition for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted
> > to prove that progress is due to original transmissions. Only then
> > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > non-SACK case.
> >
> > (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> > to better indicate its purpose but to keep this change minimal, it
> > will be done in another patch).
> >
> > Besides burstiness and congestion control violations, this problem
> > can result in RTO loop: When the loss recovery is prematurely
> > undoed, only new data will be transmitted (if available) and
> > the next retransmission can occur only after a new RTO which in case
> > of multiple losses (that are not for consecutive packets) requires
> > one RTO per loss to recover.
> >
> > Signed-off-by: Ilpo Järvinen 
> > ---
> >  net/ipv4/tcp_input.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 045d930..8e5522c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> >
> > if (tcp_is_reno(tp)) {
> > tcp_remove_reno_sacks(sk, pkts_acked);
> > +
> > +   /* If any of the cumulatively ACKed segments was
> > +* retransmitted, non-SACK case cannot confirm that
> > +* progress was due to original transmission due to
> > +* lack of TCPCB_SACKED_ACKED bits even if some of
> > +* the packets may have been never retransmitted.
> > +*/
> > +   if (flag & FLAG_RETRANS_DATA_ACKED)
> > +   flag &= ~FLAG_ORIG_SACK_ACKED;
> > } else {
> > int delta;
> 
> Thanks, Ilpo. I ran this through our TCP packetdrill tests and only
> got one failure, which detected the change you made, so that on the
> first ACK in a non-SACK F-RTO case we stay in CA_Loss.

Great, thanks for testing.

> However, depending on the exact scenario we are concerned about here,
> this may not be a complete solution. Consider the packetdrill test I
> cooked below, which is for a scenario where we imagine a non-SACK
> connection, with data packet #1 and all dupacks lost. With your patch
> we don't have an F-RTO undo on the first cumulative ACK, which is a
> definite improvement. However, we end up with an F-RTO undo on the
> *second* cumulative ACK, because the second ACK didn't cover any
> retransmitted data. That means that we end up in the second round trip
> back in the initial slow-start mode with a very high cwnd and infinite
> ssthresh, even though data was actually lost in the first round trip.
>
> I'm not sure how much we care about all these cases. Perhaps we should
> just disable F-RTO if the connection has no SACK enabled? These
> non-SACK connections are just such a rare case at this point that I
> would propose it's not worth spending too much time on this.
> 
> The following packetdrill script passes when Ilpo's patch is applied.
> This documents the behavior, which I think is questionable:
> 
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
> 
>+0 < S 0:0(0) win 32792 
>+0 > S. 0:0(0) ack 1 
>  +.02 < . 1:1(0) ack 1 win 257
>+0 accept(3, ..., ...) = 4
>+0 write(4, ..., 27000) = 27000
>+0 > . 1:10001(1) ack 1
> 
> // Suppose 1:1001 is lost and all dupacks are lost.
> 
> // RTO and

[PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-29 Thread Ilpo Järvinen

If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

(FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
to better indicate its purpose but to keep this change minimal, it
will be done in another patch).

Besides burstiness and congestion control violations, this problem
can result in RTO loop: When the loss recovery is prematurely
undoed, only new data will be transmitted (if available) and
the next retransmission can occur only after a new RTO which in case
of multiple losses (that are not for consecutive packets) requires
one RTO per loss to recover.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 045d930..8e5522c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
 
if (tcp_is_reno(tp)) {
tcp_remove_reno_sacks(sk, pkts_acked);
+
+   /* If any of the cumulatively ACKed segments was
+* retransmitted, non-SACK case cannot confirm that
+* progress was due to original transmission due to
+* lack of TCPCB_SACKED_ACKED bits even if some of
+* the packets may have been never retransmitted.
+*/
+   if (flag & FLAG_RETRANS_DATA_ACKED)
+   flag &= ~FLAG_ORIG_SACK_ACKED;
} else {
int delta;
 
-- 
2.7.4

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-04-04 Thread Ilpo Järvinen

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ych...@google.com> wrote:
> >
> > On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> > <ilpo.jarvi...@helsinki.fi> wrote:
> > > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> > >
> > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> > >> <ilpo.jarvi...@helsinki.fi> wrote:
> > >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> > >> >
> > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > >> >> <ilpo.jarvi...@helsinki.fi> wrote:
> > >> >> >
> > >> >> > A miscalculation for the number of acknowledged packets occurs 
> > >> >> > during
> > >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK 
> > >> >> > covers
> > >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> > >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start 
> > >> >> > after
> > >> >> > RTO as it may include segments that were not lost and therefore did
> > >> >> > not need retransmissions in the slow start following the RTO. Then
> > >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> > >> >> > triggering a burst.
> > >> >> >
> > >> >> > Instead, we want to pass only the number of retransmitted segments
> > >> >> > that were covered by the cumulative ACK (and potentially newly sent
> > >> >> > data segments too if the cumulative ACK covers that far).
> > >> >> >
> > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> > >> >> > ---
> > >> >>
> > >> >> My understanding is there are two problems
> > >> >>
> > >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> > >> >> precise cumulatively acked count, not newly acked count?
> > >> >
> > >> > While I'm not entirely sure if you intented to say that my fix is 
> > >> > broken
> > >> > or not, I thought this very difference alot while making the fix and I
> > >> > believe that this fix is needed because of the discontinuity at RTO
> > >> > (sacked_out is cleared as we set L-bits + lost_out). This is an 
> > >> > artifact
> > >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep 
> > >> > that
> > >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, 
> > >> > we
> > >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks 
> > >> > logic.
> > >> >
> > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > >> > tp->delivered, using plain cumulative acked causes congestion control
> > >> > breakage later as call to tcp_cong_control will directly use the
> > >> > difference in tp->delivered.
> > >> >
> > >> > This boils down the exact definition of tp->delivered (the one given in
> > >> > the header is not detailed enough). I guess you might have better idea
> > >> > what it exactly is since one of you has added it? There are subtle 
> > >> > things
> > >> > in the defination that can make it entirely unsuitable for cc 
> > >> > decisions.
> > >> > Should those segments that we (possibly) already counted into
> > >> > tp->delivered during (potentially preceeding) CA_Recovery be added to 
> > >> > it
> > >> > for _second time_ or not? This fix avoids such double counting (the
> > >> Where is the double counting, assuming normal DUPACK behavior?
> > >>
> > >> In the non-sack case:
> > >>
> > >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> > >> incrementing tp->delivered in tcp_add_reno_sack()
> > >
> > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > > I've tried to point out quite many times already)...
> > >
> > >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> > >> (high_seq), tp->delivered is incremented by (cumulati

Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-04-04 Thread Ilpo Järvinen

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvi...@helsinki.fi> wrote:
> >
> > If SACK is not enabled and the first cumulative ACK after the RTO
> > retransmission covers more than the retransmitted skb, a spurious
> > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > The reason is that any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case so the check for that bit won't help like it does with SACK).
> > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > in tcp_process_loss.
> >
> > We need to use more strict condition for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted
> > to prove that progress is due to original transmissions. Only then
> > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > non-SACK case.
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 4a26c09..c60745c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> > pkts_acked = rexmit_acked + newdata_acked;
> >
> > tcp_remove_reno_sacks(sk, pkts_acked);
> > +
> > +   /* If any of the cumulatively ACKed segments was
> > +* retransmitted, non-SACK case cannot confirm that
> > +* progress was due to original transmission due to
> > +* lack of TCPCB_SACKED_ACKED bits even if some of
> > +* the packets may have been never retransmitted.
> > +*/
> > +   if (flag & FLAG_RETRANS_DATA_ACKED)
> > +   flag &= ~FLAG_ORIG_SACK_ACKED;
> 
> How about keeping your excellent comment but move the fix to F-RTO
> code directly so it's more clear? this way the flag remains clear that
> indicates some never-retransmitted data are acked/sacked.
> 
> // pseudo code for illustration
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8d480542aa07..f7f3357de618 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2629,8 +2629,15 @@ static void tcp_process_loss(struct sock *sk,
> int flag, bool is_dupack,
> if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
> /* Step 3.b. A timeout is spurious if not all data are
>  * lost, i.e., never-retransmitted data are (s)acked.
> +*
> +* If any of the cumulatively ACKed segments was
> +* retransmitted, non-SACK case cannot confirm that
> +* progress was due to original transmission due to
> +* lack of TCPCB_SACKED_ACKED bits even if some of
> +* the packets may have been never retransmitted.
>  */
> if ((flag & FLAG_ORIG_SACK_ACKED) &&
> +   (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED) &&
> tcp_try_undo_loss(sk, true))
> return;

Of course I could put the back there but I really like the new place more 
(which was a result of your suggestion to place the code elsewhere).
IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we 
weren't successful in proving (there in tcp_clean_rtx_queue) that progress 
was due original transmission and thus I would not want falsely indicate 
it with that flag. And there's the non-SACK related block anyway already 
there so it keeps the non-SACK "pollution" off from the SACK code paths.

(In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to 
FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition 
we're after regardless of SACK and less ambiguous in non-SACK case).

-- 
 i.

Re: [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-04-04 Thread Ilpo Järvinen

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvi...@helsinki.fi> wrote:
> > When a cumulative ACK lands to high_seq at the end of loss
> > recovery and SACK is not enabled, the sender needs to avoid
> > false fast retransmits (RFC6582). The avoidance mechanisms is
> > implemented by remaining in the loss recovery CA state until
> > one additional cumulative ACK arrives. During the operation of
> > this avoidance mechanism, there is internal transient in the
> > use of state variables which will always trigger a bogus undo.
>
> Do we have to make undo in non-sack perfect? can we consider a much
> simpler but imperfect fix of
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8d480542aa07..95225d9de0af 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2356,6 +2356,7 @@ static bool tcp_try_undo_recovery(struct sock *sk)
>  * fast retransmits (RFC2582). SACK TCP is safe. */
> if (!tcp_any_retrans_done(sk))
> tp->retrans_stamp = 0;
> +   tp->undo_marker = 0;
> return true;
> }

Yes, that's of course a possible and would workaround the issue too. 
In fact, I initially did that kind of fix for myself (I put it into a 
block with tp->retrans_stamp = 0 though). But then I realized that it is 
not that complicated to make the fix locally into tcp_packet_delayed()
(except the annoyance of passing all the necessary state parameters
through the deep static call-chain but that should pose no big challenge 
for the compiler to handle I guess).

BTW, do you know under what circumstances that tcp_any_retrans_done(sk) 
would return non-zero here (snd_una == high_seq so those rexmit 
would need to be above high_seq)?

-- 
 i.

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-28 Thread Ilpo Järvinen

On Tue, 27 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> <ilpo.jarvi...@helsinki.fi> wrote:
> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> <ilpo.jarvi...@helsinki.fi> wrote:
> >> >
> >> > A miscalculation for the number of acknowledged packets occurs during
> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> > RTO as it may include segments that were not lost and therefore did
> >> > not need retransmissions in the slow start following the RTO. Then
> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> > triggering a burst.
> >> >
> >> > Instead, we want to pass only the number of retransmitted segments
> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> > data segments too if the cumulative ACK covers that far).
> >> >
> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> >> > ---
> >>
> >> My understanding is there are two problems
> >>
> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> precise cumulatively acked count, not newly acked count?
> >
> > While I'm not entirely sure if you intented to say that my fix is broken
> > or not, I thought this very difference alot while making the fix and I
> > believe that this fix is needed because of the discontinuity at RTO
> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >
> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > tp->delivered, using plain cumulative acked causes congestion control
> > breakage later as call to tcp_cong_control will directly use the
> > difference in tp->delivered.
> >
> > This boils down the exact definition of tp->delivered (the one given in
> > the header is not detailed enough). I guess you might have better idea
> > what it exactly is since one of you has added it? There are subtle things
> > in the defination that can make it entirely unsuitable for cc decisions.
> > Should those segments that we (possibly) already counted into
> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > for _second time_ or not? This fix avoids such double counting (the
> Where is the double counting, assuming normal DUPACK behavior?
> 
> In the non-sack case:
> 
> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> incrementing tp->delivered in tcp_add_reno_sack()

1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity 
I've tried to point out quite many times already)...

> 2. upon receiving a partial ACK or an ACK that acks recovery point
> (high_seq), tp->delivered is incremented by (cumulatively acked -
> #dupacks) in tcp_remove_reno_sacks()

...and this won't happen correctly anymore after RTO (since non-SACK 
won't keep #dupacks due to the discontinuity). Thus we end up adding
cumulatively acked - 0 to tp->delivered on those ACKs.

> therefore tp->delivered is tracking the # of packets delivered
> (sacked, acked, DUPACK'd) with the most information it could have
> inferred.

Since you didn't answer any of my questions about tp->delivered directly, 
let me rephrase them to this example (non-SACK, of course):

4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for 
three segments. How much should tp->delivered be incremented? 2 or 3?

...I think 2 is the right answer.

> From congestion control's perspective, it cares about the delivery
> information (e.g. how much), not the sequences (what or how).

I guess you must have missed my point. I'm talking about "how much" 
whole the time. It's just about when can we account for it (and when not).

> I am pointing out that
> 
> 1) your fix may fix one corner CC packet accounting issue in the
> non-SACK and CA_Loss
> 2) but it does not fix the other (major) CC packet accounting issue in
> the SACK case

You say major but it's major if and only if one is using one of the 
affected cc modules wh

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-27 Thread Ilpo Järvinen

On Mon, 26 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvi...@helsinki.fi> wrote:
> >
> > A miscalculation for the number of acknowledged packets occurs during
> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > any non-retransmitted skbs. The reason is that pkts_acked value
> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > RTO as it may include segments that were not lost and therefore did
> > not need retransmissions in the slow start following the RTO. Then
> > tcp_slow_start will add the excess into cwnd bloating it and
> > triggering a burst.
> >
> > Instead, we want to pass only the number of retransmitted segments
> > that were covered by the cumulative ACK (and potentially newly sent
> > data segments too if the cumulative ACK covers that far).
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 16 +++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..4a26c09 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> > long seq_rtt_us = -1L;
> > long ca_rtt_us = -1L;
> > u32 pkts_acked = 0;
> > +   u32 rexmit_acked = 0;
> > +   u32 newdata_acked = 0;
> > u32 last_in_flight = 0;
> > bool rtt_update;
> > int flag = 0;
> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> > }
> >
> > if (unlikely(sacked & TCPCB_RETRANS)) {
> > -   if (sacked & TCPCB_SACKED_RETRANS)
> > +   if (sacked & TCPCB_SACKED_RETRANS) {
> > tp->retrans_out -= acked_pcount;
> > +   rexmit_acked += acked_pcount;
> > +   }
> > flag |= FLAG_RETRANS_DATA_ACKED;
> > } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> > last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> > reord = start_seq;
> > if (!after(scb->end_seq, tp->high_seq))
> > flag |= FLAG_ORIG_SACK_ACKED;
> > +   else
> > +   newdata_acked += acked_pcount;
> > }
> >
> > if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> > }
> >
> > if (tcp_is_reno(tp)) {
> > +   /* Due to discontinuity on RTO in the artificial
> > +* sacked_out calculations, TCP must restrict
> > +* pkts_acked without SACK to rexmits and new data
> > +* segments
> > +*/
> > +   if (icsk->icsk_ca_state == TCP_CA_Loss)
> > +   pkts_acked = rexmit_acked + newdata_acked;
> > +
> My understanding is there are two problems
> 
> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> precise cumulatively acked count, not newly acked count?

While I'm not entirely sure if you intented to say that my fix is broken 
or not, I thought this very difference alot while making the fix and I 
believe that this fix is needed because of the discontinuity at RTO 
(sacked_out is cleared as we set L-bits + lost_out). This is an artifact 
in the imitation of sacked_out for non-SACK but at RTO we can't keep that 
in sync because we set L-bits (and have no S-bits to guide us). Thus, we 
cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.

In tcp_remove_reno_sacks acked - sacked_out is being used to calculate 
tp->delivered, using plain cumulative acked causes congestion control 
breakage later as call to tcp_cong_control will directly use the 
difference in tp->delivered.

This boils down the exact definition of tp->delivered (the one given in 
the header is not detailed enough). I guess you might have better idea
what it exactly is since one of you has added it? There are subtle things 
in the defination that can make it entirely u

Re: [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP

2018-03-15 Thread Ilpo Järvinen

On Tue, 13 Mar 2018, Eric Dumazet wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvi...@helsinki.fi> wrote:
> > Here is a series of fixes to issues that occur when SACK is not
> > enabled for a TCP connection. These are not fixes to just some
> > remote corner cases of recovery but many/almost all recoveries
> > without SACK will be impacted by one (or even more than one) of
> > them. The sender-side changes (1-4) are not mainly, if any, to
> > improve performance (throughput) but address correctness
> > (congestion control responses should not get incorrectly
> > reverted) and burstiness (that may cause additional problems
> > later as some of the packets in such bursts may get dropped
> > needing again to resort to loss recovery that is likely
> > similarly bursty).
> 
> GRO (or similar) adoption a long time ago (not only by linux) had a
> serious impact on non SACK flow.
> Should we also disable GRO by default ?
> (my answer is no, just in case someone wonders)
>
> I am sorry, but I am still not convinced by your patches, trying to
> give a wrong incentive to people keeping their prehistoric stacks,
> that have serious problems on wifi anyway, and or middle boxes
> "aggregating/compressing ACKS"

So now you think that I'm against high perf enhancements (even after 
having written some bits of them)?

However, I think that in case somebody disables something that is enabled 
by default, be it SACK, GRO, timestamps, etc., TCP he/she will get should 
still make sense (and that regardless of some middleboxes trying hard to 
break otherwise working stuff). But I guess you don't agree here(?) and 
would perhaps even want to remove ability to disable them? Although 
obviously that wouldn't even work with non-SACK (unless RST or so starts 
to get used but even that could be worked-around unfortunately).

Also, I'm somewhat confused your position here. On one hand you said that
tests should be added to regression tests to avoid breaking these non-SACK
cases again implying that things should be fixed and on the other hand you 
seem to say that non-SACK must be left broken?

> The last patch is particularly problematic in my opinion, you have not
> provided packetdrill tests to prove there was no danger.

Can that even be done? Proving absence of danger with packetdrill?
AFAIK that kind of proofs are available only with formal verification.

> It seems you leave to us the task of dealing with possible issues,
> only added a bold changelog.

Heh, I tried to add details to the changelog because you explicitly said
I should and now you fault me on doing that :-). Also, you're leaving out 
the detail that I tried to understand the condition that worried you and
found out that the code already disallows any shrinking of advertized 
window for duplicate ACKs (but I guess there just might have been some 
miscommunication between us given that your last reply to 5/5
v1 made no sense to me).

> Since the bugs you claim to fix are at least 10 years old, and your
> fixes are far from being trivial,
> please wait our packetdrill regression tests being published.
> This should happen in less than one month I believe, in time for linux-4.17
> 
> Note that the publication of the updated packetdrill and test suite is
> not an easy task,
> since we accumulated a lot of hacks both in kernel to cope with timer
> slacks and in packetdrill
> for various experimental or private features that are not yet in
> upstream kernels.
> 
> So we are cleaning all this, and will be happy to help you if you need.

I slightly disagree on the trivial bit here as I think it's trivial to 
see the changes only affect non-SACK flows (of which you seem to say you 
don't care if they're broken as that gives incentive to use SACK). But 
then I'm not too worried about waiting a month.

-- 
 i.

[PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-13 Thread Ilpo Järvinen

Currently, the TCP code is overly eager to increase window on
almost every ACK. It makes those ACKs that the receiver should
sent as dupACKs look like they update window that is not
considered a real dupACK by the non-SACK sender-side code.
Therefore the sender needs to resort to RTO to recover
losses as fast retransmit/fast recovery cannot be triggered
by such masked duplicate ACKs.

This change makes sure that when an ofo segment is received,
no change to window is applied if we are going to send a dupACK.
Even with this change, the window updates keep being maintained
but that occurs "behind the scenes". That is, this change does
not interfere with memory management of the flow which could
have long-term impact for the progress of the flow but only
prevents those updates being seen on the wire on short-term.
It's ok to change the window for non-dupACKs such as the first
ACK after ofo arrivals start if that ACK was using delayed ACKs
and also whenever the ack field advances. As ack field typically
advances once per RTT as the first hole is retransmitted, the
window updates are not delayed entirely during long recoveries.

Even before this fix, tcp_select_window did not allow ACK
shrinking the window for duplicate ACKs (that was previously
even called "treason" but the old charmy message is gone now).
The advertized window can only shrink when also ack field
changes which will not be blocked by this change as it is not
duplicate ACK.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 -
 net/ipv4/tcp_output.c | 43 +--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+   unused:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 270aa48..4ff192b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4626,6 +4626,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
bool fragstolen;
int eaten;
 
@@ -4669,7 +4670,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
 * gap in queue is filled.
 */
if (RB_EMPTY_ROOT(>out_of_order_queue))
-   inet_csk(sk)->icsk_ack.pingpong = 0;
+   icsk->icsk_ack.pingpong = 0;
}
 
if (tp->rx_opt.num_sacks)
@@ -4719,6 +4720,8 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
goto queue_and_out;
}
 
+   if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+   tp->dupack_wnd_lock = 1;
tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
u32 old_win = tp->rcv_wnd;
-   u32 cur_win = tcp_receive_window(tp);
-   u32 new_win = __tcp_select_window(sk);
-
-   /* Never shrink the offered window */
-   if (new_win < cur_win) {
-   /* Danger Will Robinson!
-* Don't update rcv_wup/rcv_wnd here or else
-* we will not be able to advertise a zero
-* window in time.  --DaveM
-*
-* Relax Will Robinson.
-*/
-   if (new_win == 0)
-   NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPWANTZEROWINDOWADV);
-   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   u32 cur_win;
+   u32 new_win;
+
+   if (tp->dupack_wnd_lock) {
+   new_win = old_win;
+   tp->dupack_wnd_lock = 0;
+   } else {
+   cur_win = tcp_receive_window(tp);
+   new_win = __tcp_select_window(sk);
+   /* Never shrink the offered window */
+

[PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()

2018-03-13 Thread Ilpo Järvinen

No functional changes. This change simplifies the next change
slightly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c60745c..72ecfbb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp,
+   const u32 snd_una)
+{
+   return tcp_is_reno(tp) && (snd_una == tp->high_seq);
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
} else if (tp->rack.reo_wnd_persist) {
tp->rack.reo_wnd_persist--;
}
-   if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-   /* Hold old state until something *above* high_seq
-* is ACKed. For Reno it is MUST to prevent false
-* fast retransmits (RFC2582). SACK TCP is safe. */
+   if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) {
+   /* Hold old state until something *above* high_seq is ACKed
+* if false fast retransmit is possible.
+*/
if (!tcp_any_retrans_done(sk))
tp->retrans_stamp = 0;
return true;
-- 
2.7.4

[PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-13 Thread Ilpo Järvinen

A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..4a26c09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
long seq_rtt_us = -1L;
long ca_rtt_us = -1L;
u32 pkts_acked = 0;
+   u32 rexmit_acked = 0;
+   u32 newdata_acked = 0;
u32 last_in_flight = 0;
bool rtt_update;
int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (unlikely(sacked & TCPCB_RETRANS)) {
-   if (sacked & TCPCB_SACKED_RETRANS)
+   if (sacked & TCPCB_SACKED_RETRANS) {
tp->retrans_out -= acked_pcount;
+   rexmit_acked += acked_pcount;
+   }
flag |= FLAG_RETRANS_DATA_ACKED;
} else if (!(sacked & TCPCB_SACKED_ACKED)) {
last_ackt = skb->skb_mstamp;
@@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
reord = start_seq;
if (!after(scb->end_seq, tp->high_seq))
flag |= FLAG_ORIG_SACK_ACKED;
+   else
+   newdata_acked += acked_pcount;
}
 
if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (tcp_is_reno(tp)) {
+   /* Due to discontinuity on RTO in the artificial
+* sacked_out calculations, TCP must restrict
+* pkts_acked without SACK to rexmits and new data
+* segments
+*/
+   if (icsk->icsk_ca_state == TCP_CA_Loss)
+   pkts_acked = rexmit_acked + newdata_acked;
+
tcp_remove_reno_sacks(sk, pkts_acked);
} else {
int delta;
-- 
2.7.4

[PATCH v3 net 0/5] tcp: fixes to non-SACK TCP

2018-03-13 Thread Ilpo Järvinen

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection. These are not fixes to just some
remote corner cases of recovery but many/almost all recoveries
without SACK will be impacted by one (or even more than one) of
them. The sender-side changes (1-4) are not mainly, if any, to
improve performance (throughput) but address correctness
(congestion control responses should not get incorrectly
reverted) and burstiness (that may cause additional problems
later as some of the packets in such bursts may get dropped
needing again to resort to loss recovery that is likely
similarly bursty).

v1 -> v2:
- Tried to improve changelogs
- Reworked FRTO undo fix location
- Removed extra parenthesis from EXPR (and while at it, reverse
  also the sides of &&)
- Pass prior_snd_una rather than flag around to avoid moving
  tcp_packet_delayed call
- Pass tp instead of sk. Sk was there only due to a subsequent
  change (that I think is only net-next material) limiting the
  use of the transient state to only RTO recoveries as it won't
  be needed after NewReno recovery that won't do unnecessary
  rexmits like the non-SACK RTO recovery does

v2 -> v3:
- Remove unnecessarily added braces

tcp: feed correct number of pkts acked to cc
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized

[PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-13 Thread Ilpo Järvinen

When a cumulative ACK lands to high_seq at the end of loss
recovery and SACK is not enabled, the sender needs to avoid
false fast retransmits (RFC6582). The avoidance mechanisms is
implemented by remaining in the loss recovery CA state until
one additional cumulative ACK arrives. During the operation of
this avoidance mechanism, there is internal transient in the
use of state variables which will always trigger a bogus undo.

When we enter to this transient state in tcp_try_undo_recovery,
tcp_any_retrans_done is often (always?) false resulting in
clearing retrans_stamp. On the next cumulative ACK,
tcp_try_undo_recovery again executes because CA state still
remains in the same recovery state and tcp_may_undo will always
return true because tcp_packet_delayed has this condition:
return !tp->retrans_stamp || ...

Check if the false fast retransmit transient avoidance is in
progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
has advanced already on this ACK but CA state still remains
unchanged (CA state is updated slightly later than undo is
checked), prior_snd_una needs to be passed to tcp_packet_delayed
(instead of tp->snd_una). Passing prior_snd_una around to
the tcp_packet_delayed makes this change look more involved than
it really is.

The additional checks done in this change only affect non-SACK
case, the SACK case remains the same.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 42 ++
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 72ecfbb..270aa48 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct 
tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
+ const u32 prior_snd_una)
 {
-   return !tp->retrans_stamp ||
-  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+   if (!tp->retrans_stamp) {
+   /* Sender will be in a transient state with cleared
+* retrans_stamp during false fast retransmit prevention
+* mechanism
+*/
+   return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
+   }
+   return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
bool unmark_loss)
tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp,
+   const u32 prior_snd_una)
 {
-   return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+   return tp->undo_marker &&
+  (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_may_undo(tp)) {
+   if (tcp_may_undo(tp, prior_snd_una)) {
int mib_idx;
 
/* Happy end! We did not retransmit anything
@@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
+ bool frto_undo)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (frto_undo || tcp_may_undo(tp)) {
+   if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
tcp_undo_cwnd_reduction(sk, true);
 
DBGUNDO(sk, "partial loss");
@@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
 static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
-int *rexmit)
+int *rexmit, const u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
bool recovered = !before(tp->snd_una, tp->high_seq);
 
if ((flag & FLAG_SND_UNA_ADVANCED) &&
-   tcp_try_undo_loss(sk, false))
+   tcp_try_undo_loss(sk, prior_snd_una, false))
return;
 
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk,

[PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-13 Thread Ilpo Järvinen

If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a26c09..c60745c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
pkts_acked = rexmit_acked + newdata_acked;
 
tcp_remove_reno_sacks(sk, pkts_acked);
+
+   /* If any of the cumulatively ACKed segments was
+* retransmitted, non-SACK case cannot confirm that
+* progress was due to original transmission due to
+* lack of TCPCB_SACKED_ACKED bits even if some of
+* the packets may have been never retransmitted.
+*/
+   if (flag & FLAG_RETRANS_DATA_ACKED)
+   flag &= ~FLAG_ORIG_SACK_ACKED;
} else {
int delta;
 
-- 
2.7.4

Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-13 Thread Ilpo Järvinen

On Fri, 9 Mar 2018, David Miller wrote:

> From: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)
> 
> > Unfortunately I don't have now permission to publish the time-seq
> > graph about it but I've tried to improve the changelog messages so
> > that you can better understand under which conditions the problem
> > occurs.
> 
> It is indeed extremely unfortunate that you wish to justify a change
> for which you cannot provide the supporting data at all.

Here is the time-seqno graph about the issue:

https://www.cs.helsinki.fi/u/ijjarvin/linux/nonsackbugs/recovery_undo_bug.pdf

First the correct CC action (wnd reduction) occurs; then bogus undo 
causes bursting back to the window with which the congestion losses 
occurred earlier; because of the burst, some packets get lost due to 
congestion again.

The sender is actually somewhat lucky here: If only one packet would get 
lost instead of three, the same process would repeat for the next recovery 
(as cumulative ACK to high_seq condition would reoccur).

-- 
 i.

Re: [PATCH v2 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-09 Thread Ilpo Järvinen

On Fri, 9 Mar 2018, Sergei Shtylyov wrote:

> On 03/09/2018 05:10 PM, Ilpo Järvinen wrote:
> 
> > @@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> > -   if (!after(scb->end_seq, tp->high_seq))
> > +   if (!after(scb->end_seq, tp->high_seq)) {
> > flag |= FLAG_ORIG_SACK_ACKED;
> > +   } else {
> > +   newdata_acked += acked_pcount;
> > +   }
> 
>Why add {} where they're not needed?

Not for a particular reason (I don't know even myself why I added them 
here as usually I don't add such extra braces).

I'll remove them, thanks.

-- 
 i.

Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-09 Thread Ilpo Järvinen

On Fri, 9 Mar 2018, David Miller wrote:

> From: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
> Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)
> 
> > Unfortunately I don't have now permission to publish the time-seq
> > graph about it but I've tried to improve the changelog messages so
> > that you can better understand under which conditions the problem
> > occurs.
> 
> It is indeed extremely unfortunate that you wish to justify a change
> for which you cannot provide the supporting data at all.

Well, the permission for time-seq graps was/is still simply just in 
pending state and then my patience on sending v2 out ran out :-).

Given that I perceived right from the start that the main purpose for
that "data" was to create a packetdrill test for that testsuite (that
is still in the soon-to-be-published state after all these years ;-)),
I didn't find significant benefit from a time-seq graph compared with
more verbose description that is now place in the changelog. In fact,
I think that time-seq graph well into the flow will be less useful than
the current explination in the changelog if one wants to come up a
packetdrill test but I'm no packetdrill expert.

-- 
 i.

Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-09 Thread Ilpo Järvinen

On Wed, 7 Mar 2018, Yuchung Cheng wrote:

> On Wed, Mar 7, 2018 at 12:19 PM, Neal Cardwell <ncardw...@google.com> wrote:
> >
> > On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> 
> > wrote:
> > > A bogus undo may/will trigger when the loss recovery state is
> > > kept until snd_una is above high_seq. If tcp_any_retrans_done
> > > is zero, retrans_stamp is cleared in this transient state. On
> > > the next ACK, tcp_try_undo_recovery again executes and
> > > tcp_may_undo will always return true because tcp_packet_delayed
> > > has this condition:
> > > return !tp->retrans_stamp || ...
> > >
> > > Check for the false fast retransmit transient condition in
> > > tcp_packet_delayed to avoid bogus undos. Since snd_una may have
> > > advanced on this ACK but CA state still remains unchanged,
> > > prior_snd_una needs to be passed instead of tp->snd_una.
> >
> > This one also seems like a case where it would be nice to have a
> > specific packet-by-packet example, or trace, or packetdrill scenario.
> > Something that we might be able to translate into a test, or at least
> > to document the issue more explicitly.
>
> I am hesitate for further logic to make undo "perfect" on non-sack
> cases b/c undo is very complicated and SACK is extremely
> well-supported today. so a trace to demonstrate how severe this issue
> is appreciated.

This is not just some remote corner cases to which I perhaps would 
understand your "making undo perfect" comment. Those undos result in
a burst that, at worst, triggers additional buffer overflow because the 
correct CC action is cancelled. Unfortunately I don't have now permission
to publish the time-seq graph about it but I've tried to improve the
changelog messages so that you can better understand under which
conditions the problem occurs.

SACK case remains the same even after this change. I did rework the
logic a bit though (pass prior_snd_una rather than flag around) to
make it more obvious but the change is unfortunately lengthy (no matter
what I pass through the call-chain).

-- 
 i.

[PATCH v2 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-09 Thread Ilpo Järvinen

A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..0305f6d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
long seq_rtt_us = -1L;
long ca_rtt_us = -1L;
u32 pkts_acked = 0;
+   u32 rexmit_acked = 0;
+   u32 newdata_acked = 0;
u32 last_in_flight = 0;
bool rtt_update;
int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (unlikely(sacked & TCPCB_RETRANS)) {
-   if (sacked & TCPCB_SACKED_RETRANS)
+   if (sacked & TCPCB_SACKED_RETRANS) {
tp->retrans_out -= acked_pcount;
+   rexmit_acked += acked_pcount;
+   }
flag |= FLAG_RETRANS_DATA_ACKED;
} else if (!(sacked & TCPCB_SACKED_ACKED)) {
last_ackt = skb->skb_mstamp;
@@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
last_in_flight = TCP_SKB_CB(skb)->tx.in_flight;
if (before(start_seq, reord))
reord = start_seq;
-   if (!after(scb->end_seq, tp->high_seq))
+   if (!after(scb->end_seq, tp->high_seq)) {
flag |= FLAG_ORIG_SACK_ACKED;
+   } else {
+   newdata_acked += acked_pcount;
+   }
}
 
if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3158,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (tcp_is_reno(tp)) {
+   /* Due to discontinuity on RTO in the artificial
+* sacked_out calculations, TCP must restrict
+* pkts_acked without SACK to rexmits and new data
+* segments
+*/
+   if (icsk->icsk_ca_state == TCP_CA_Loss)
+   pkts_acked = rexmit_acked + newdata_acked;
+
tcp_remove_reno_sacks(sk, pkts_acked);
} else {
int delta;
-- 
2.7.4

[PATCH v2 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-09 Thread Ilpo Järvinen

Currently, the TCP code is overly eager to increase window on
almost every ACK. It makes those ACKs that the receiver should
sent as dupACKs look like they update window that is not
considered a real dupACK by the non-SACK sender-side code.
Therefore the sender needs to resort to RTO to recover
losses as fast retransmit/fast recovery cannot be triggered
by such masked duplicate ACKs.

This change makes sure that when an ofo segment is received,
no change to window is applied if we are going to send a dupACK.
Even with this change, the window updates keep being maintained
but that occurs "behind the scenes". That is, this change does
not interfere with memory management of the flow which could
have long-term impact for the progress of the flow but only
prevents those updates being seen on the wire on short-term.
It's ok to change the window for non-dupACKs such as the first
ACK after ofo arrivals start if that ACK was using delayed ACKs
and also whenever the ack field advances. As ack field typically
advances once per RTT as the first hole is retransmitted, the
window updates are not delayed entirely during long recoveries.

Even before this fix, tcp_select_window did not allow ACK
shrinking the window for duplicate ACKs (that was previously
even called "treason" but the old charmy message is gone now).
The advertized window can only shrink when also ack field
changes which will not be blocked by this change as it is not
duplicate ACK.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 -
 net/ipv4/tcp_output.c | 43 +--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+   unused:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e67551a..ca49b62 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4627,6 +4627,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
bool fragstolen;
int eaten;
 
@@ -4670,7 +4671,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
 * gap in queue is filled.
 */
if (RB_EMPTY_ROOT(>out_of_order_queue))
-   inet_csk(sk)->icsk_ack.pingpong = 0;
+   icsk->icsk_ack.pingpong = 0;
}
 
if (tp->rx_opt.num_sacks)
@@ -4720,6 +4721,8 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
goto queue_and_out;
}
 
+   if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+   tp->dupack_wnd_lock = 1;
tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
u32 old_win = tp->rcv_wnd;
-   u32 cur_win = tcp_receive_window(tp);
-   u32 new_win = __tcp_select_window(sk);
-
-   /* Never shrink the offered window */
-   if (new_win < cur_win) {
-   /* Danger Will Robinson!
-* Don't update rcv_wup/rcv_wnd here or else
-* we will not be able to advertise a zero
-* window in time.  --DaveM
-*
-* Relax Will Robinson.
-*/
-   if (new_win == 0)
-   NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPWANTZEROWINDOWADV);
-   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   u32 cur_win;
+   u32 new_win;
+
+   if (tp->dupack_wnd_lock) {
+   new_win = old_win;
+   tp->dupack_wnd_lock = 0;
+   } else {
+   cur_win = tcp_receive_window(tp);
+   new_win = __tcp_select_window(sk);
+   /* Never shrink the offered window */
+

[PATCH v2 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-09 Thread Ilpo Järvinen

If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0305f6d..1277afb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3167,6 +3167,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
pkts_acked = rexmit_acked + newdata_acked;
 
tcp_remove_reno_sacks(sk, pkts_acked);
+
+   /* If any of the cumulatively ACKed segments was
+* retransmitted, non-SACK case cannot confirm that
+* progress was due to original transmission due to
+* lack of TCPCB_SACKED_ACKED bits even if some of
+* the packets may have been never retransmitted.
+*/
+   if (flag & FLAG_RETRANS_DATA_ACKED)
+   flag &= ~FLAG_ORIG_SACK_ACKED;
} else {
int delta;
 
-- 
2.7.4

[PATCH v2 net 0/5] tcp: fixes to non-SACK TCP

2018-03-09 Thread Ilpo Järvinen

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection. These are not fixes to just some
remote corner cases of recovery but many/almost all recoveries
without SACK will be impacted by one (or even more than one) of
them. The bogus bursts that the sender side fixes are for may
cause additional problems later as some of the packets in such
bursts may get dropped needing again to resort to loss recovery
that is likely similarly bursty.

v1 -> v2:
- Tried to improve changelogs
- Reworked FRTO undo fix location
- Removed extra parenthesis from EXPR (and while at it, reverse
  also the sides of &&)
- Pass prior_snd_una rather than flag around to avoid moving
  tcp_packet_delayed call
- Pass tp instead of sk. sk was there only due to a subsequent
  change (that I think is only net-next material) limiting the
  use of the false FR avoidance to only RTO recoveries (as it
  won't be needed after NewReno recovery that won't do
  unnecessary rexmits like the non-SACK RTO recovery does)

tcp: feed correct number of pkts acked to cc
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized

[PATCH v2 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()

2018-03-09 Thread Ilpo Järvinen

No functional changes. This change simplifies the next change
slightly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1277afb..4ccba94 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp,
+   const u32 snd_una)
+{
+   return tcp_is_reno(tp) && (snd_una == tp->high_seq);
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
} else if (tp->rack.reo_wnd_persist) {
tp->rack.reo_wnd_persist--;
}
-   if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-   /* Hold old state until something *above* high_seq
-* is ACKed. For Reno it is MUST to prevent false
-* fast retransmits (RFC2582). SACK TCP is safe. */
+   if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) {
+   /* Hold old state until something *above* high_seq is ACKed
+* if false fast retransmit is possible.
+*/
if (!tcp_any_retrans_done(sk))
tp->retrans_stamp = 0;
return true;
-- 
2.7.4

[PATCH v2 net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-09 Thread Ilpo Järvinen

When a cumulative ACK lands to high_seq at the end of loss
recovery and SACK is not enabled, the sender needs to avoid
false fast retransmits (RFC6582). The avoidance mechanisms is
implemented by remaining in the loss recovery CA state until
one additional cumulative ACK arrives. During the operation of
this avoidance mechanism, there is internal transient in the
use of state variables which will always trigger a bogus undo.

When we enter to this transient state in tcp_try_undo_recovery,
tcp_any_retrans_done is often (always?) false resulting in
clearing retrans_stamp. On the next cumulative ACK,
tcp_try_undo_recovery again executes because CA state still
remains in the same recovery state and tcp_may_undo will always
return true because tcp_packet_delayed has this condition:
return !tp->retrans_stamp || ...

Check if the false fast retransmit transient avoidance is in
progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
has advanced already on this ACK but CA state still remains
unchanged (CA state is updated slightly later than undo is
checked), prior_snd_una needs to be passed to tcp_packet_delayed
(instead of tp->snd_una). Passing prior_snd_una around to
the tcp_packet_delayed makes this change look more involved than
it really is.

The additional checks done in this change only affect non-SACK
case, the SACK case remains the same.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 42 ++
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4ccba94..e67551a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct 
tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
+ const u32 prior_snd_una)
 {
-   return !tp->retrans_stamp ||
-  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+   if (!tp->retrans_stamp) {
+   /* Sender will be in a transient state with cleared
+* retrans_stamp during false fast retransmit prevention
+* mechanism
+*/
+   return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
+   }
+   return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
bool unmark_loss)
tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp,
+   const u32 prior_snd_una)
 {
-   return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+   return tp->undo_marker &&
+  (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_may_undo(tp)) {
+   if (tcp_may_undo(tp, prior_snd_una)) {
int mib_idx;
 
/* Happy end! We did not retransmit anything
@@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
+ bool frto_undo)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (frto_undo || tcp_may_undo(tp)) {
+   if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
tcp_undo_cwnd_reduction(sk, true);
 
DBGUNDO(sk, "partial loss");
@@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
 static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
-int *rexmit)
+int *rexmit, const u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
bool recovered = !before(tp->snd_una, tp->high_seq);
 
if ((flag & FLAG_SND_UNA_ADVANCED) &&
-   tcp_try_undo_loss(sk, false))
+   tcp_try_undo_loss(sk, prior_snd_una, false))
return;
 
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk,

Re: [PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-07 Thread Ilpo Järvinen

On Wed, 7 Mar 2018, Yuchung Cheng wrote:
> On Wed, Mar 7, 2018 at 11:24 AM, Neal Cardwell <ncardw...@google.com> wrote:
> > On Wed, Mar 7, 2018 at 7:59 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> 
> > wrote:
> > >
> > > In a non-SACK case, any non-retransmitted segment acknowledged will
> > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > > no indication that it would have been delivered for real (the
> > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > > case). This causes bogus undos in ordinary RTO recoveries where
> > > segments are lost here and there, with a few delivered segments in
> > > between losses. A cumulative ACKs will cover retransmitted ones at
> > > the bottom and the non-retransmitted ones following that causing
> > > FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results
> > > in a spurious FRTO undo.
> > >
> > > We need to make the check more strict for non-SACK case and check
> > > that none of the cumulatively ACKed segments were retransmitted,
> > > which would be the case for the last step of FRTO algorithm as we
> > > sent out only new segments previously. Only then, allow FRTO undo
> > > to proceed in non-SACK case.
> >
> > Hi Ilpo - Do you have a packet trace or (even better) packetdrill
> > script illustrating this issue? It would be nice to have a test case
> > or at least concrete example of this.
>
> a packetdrill or even a contrived example would be good ...

I've seen all but this for sure in packet traces. But I'm somewhat 
old-school that while looking for the burst issue I discovered this 
issue by reading the code only (making it more than _one_ issue).
However, I think that I later on saw also this issue from the traces
(as it seemed to not match to any of the other burst issues this whole 
series is trying to fix). But finding that dump afterwards would take 
really long time, I've more than enough of them from our recent
tests ;-)).

But anyway, that was before the recent moving for the condition into 
tp->frto block so it might no longer be triggerable. It clearly was 
triggerable beforehand without tp->frto guard (and I just forward-ported 
past that recent change without thinking it much).

To trigger it, ever-R and !ever-R skb would need to be cumulatively 
ACKed when tp->frto is non-zero. Do you think that is still possible
with FRTO? E.g., after some undo leaving some ever-R and then RTO 
resulting in FRTO procedure?

> also why not just avoid setting FLAG_ORIG_SACK_ACKED on non-sack? seems 
> a much clean fix.

I guess that would work now that the relevant FRTO condition got moved
into the tp->frto block. It wouldn't have been that simple earlier
as SACK wanted FLAG_ORIG_SACK_ACKED while non-SACK wants
FLAG_ONLY_ORIG_ACKED (that was already available through a combination
of the existing FLAGs).

-- 
 i.

Re: [PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-07 Thread Ilpo Järvinen

On Wed, 7 Mar 2018, Eric Dumazet wrote:

> > Currently non-SACK senders cannot identify almost any duplicate ACKs 
> > because the window keeps updating for almost all ACKs. As a result, 
> > non-SACK senders end up doing loss recovery only with RTO. RTO
> > recovery 
> > without SACK is quite annoying because it potentially sends 
> > large number of unnecessary rexmits.
> 
> I get that, but switching from "always" to "never" sounds dangerous.
> 
> May I suggest you refine your patch, to maybe allow win reductions, in
> a logarithmic fashion maybe ?
> 
> This way, when you send 1000 DUPACK, only few of them would actually
> have to lower the window, and 99% of them would be considered as DUPACK
> by these prehistoric TCP stacks.

The problem I'm trying to fix is due to window increasing (of course, 
dupacks could also be masked because of decrease but that's something
I've never seen to occur in practice) so I tried to make you happy and 
change my fix to do "never increase, always decrease". However, it turns 
out that even pre-fixed code implements "never decrease" not "always 
decrease" like you thought:

static u16 tcp_select_window(struct sock *sk)
{
...
/* Never shrink the offered window */
if (new_win < cur_win) {
...
new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);

...Thus, I don't see my fix making the case you're worried about any
worse, AFAICT.

-- 
 i.

Re: [PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-07 Thread Ilpo Järvinen

Thanks for taking a look.

On Wed, 7 Mar 2018, Eric Dumazet wrote:
> On Wed, 2018-03-07 at 14:59 +0200, Ilpo Järvinen wrote:
> > Currently, the TCP code is overly eager to update window on
> > every ACK. It makes some of the ACKs that the receiver should
> > sent as dupACKs look like they update window update that are
> > not considered real dupACKs by the non-SACK sender-side code.
> > 
> > Make sure that when an ofo segment is received, no change to
> > window is applied if we are going to send a dupACK. It's ok
> > to change the window for non-dupACKs (such as the first ACK
> > after ofo arrivals start if that ACK was using delayed ACKs).
> 
> This looks dangerous to me.
> 
> We certainly want to lower the window at some point, or risk expensive
> pruning and/or drops.

I see you're conspiring for "treason" (if you recall those charmy
times) ;-).

> It is not clear by reading your changelog/patch, but it looks like some
> malicious peers could hurt us.

The malicious peers can certainly send out-of-window data already at will 
(with all of such packets being dropped regardless of that being 
expensive or not) so I don't see there's a big change for malicious case 
really. The window is only locked for what we've already promised for in 
an earlier ACK! ...Previously, reneging that promise (advertising smaller 
than the previous value) was called "treason" by us (unfortunately, the 
message is no longer as charmy).

Even with this change, the window is free to change when the ack field is 
updated which for legimate senders occurs typically once per RTT.

> By current standards, non TCP sack flows are not worth any potential
> issues.

Currently non-SACK senders cannot identify almost any duplicate ACKs 
because the window keeps updating for almost all ACKs. As a result, 
non-SACK senders end up doing loss recovery only with RTO. RTO recovery 
without SACK is quite annoying because it potentially sends 
large number of unnecessary rexmits.

-- 
 i.

[PATCH net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()

2018-03-07 Thread Ilpo Järvinen

No functional changes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1a33752..e20f9ad 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,19 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct sock *sk,
+   const u32 snd_una)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return ((snd_una == tp->high_seq) && tcp_is_reno(tp));
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2363,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
} else if (tp->rack.reo_wnd_persist) {
tp->rack.reo_wnd_persist--;
}
-   if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-   /* Hold old state until something *above* high_seq
-* is ACKed. For Reno it is MUST to prevent false
-* fast retransmits (RFC2582). SACK TCP is safe. */
+   if (tcp_false_fast_retrans_possible(sk, tp->snd_una)) {
+   /* Hold old state until something *above* high_seq is ACKed
+* if false fast retransmit is possible.
+*/
if (!tcp_any_retrans_done(sk))
tp->retrans_stamp = 0;
return true;
-- 
2.7.4

[PATCH net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-07 Thread Ilpo Järvinen

The pkts_acked is clearly not the correct value for slow start
after RTO as it may include segments that were not lost and
therefore did not need retransmissions in the slow start following
the RTO. Then tcp_slow_start will add the excess into cwnd bloating
it and triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..0305f6d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
long seq_rtt_us = -1L;
long ca_rtt_us = -1L;
u32 pkts_acked = 0;
+   u32 rexmit_acked = 0;
+   u32 newdata_acked = 0;
u32 last_in_flight = 0;
bool rtt_update;
int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (unlikely(sacked & TCPCB_RETRANS)) {
-   if (sacked & TCPCB_SACKED_RETRANS)
+   if (sacked & TCPCB_SACKED_RETRANS) {
tp->retrans_out -= acked_pcount;
+   rexmit_acked += acked_pcount;
+   }
flag |= FLAG_RETRANS_DATA_ACKED;
} else if (!(sacked & TCPCB_SACKED_ACKED)) {
last_ackt = skb->skb_mstamp;
@@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
last_in_flight = TCP_SKB_CB(skb)->tx.in_flight;
if (before(start_seq, reord))
reord = start_seq;
-   if (!after(scb->end_seq, tp->high_seq))
+   if (!after(scb->end_seq, tp->high_seq)) {
flag |= FLAG_ORIG_SACK_ACKED;
+   } else {
+   newdata_acked += acked_pcount;
+   }
}
 
if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3158,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (tcp_is_reno(tp)) {
+   /* Due to discontinuity on RTO in the artificial
+* sacked_out calculations, TCP must restrict
+* pkts_acked without SACK to rexmits and new data
+* segments
+*/
+   if (icsk->icsk_ca_state == TCP_CA_Loss)
+   pkts_acked = rexmit_acked + newdata_acked;
+
tcp_remove_reno_sacks(sk, pkts_acked);
} else {
int delta;
-- 
2.7.4

[PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-07 Thread Ilpo Järvinen

A bogus undo may/will trigger when the loss recovery state is
kept until snd_una is above high_seq. If tcp_any_retrans_done
is zero, retrans_stamp is cleared in this transient state. On
the next ACK, tcp_try_undo_recovery again executes and
tcp_may_undo will always return true because tcp_packet_delayed
has this condition:
return !tp->retrans_stamp || ...

Check for the false fast retransmit transient condition in
tcp_packet_delayed to avoid bogus undos. Since snd_una may have
advanced on this ACK but CA state still remains unchanged,
prior_snd_una needs to be passed instead of tp->snd_una.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 50 ++
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e20f9ad..b689915 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,6 +98,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_UPDATE_TS_RECENT  0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK  0x8000 /* do not call tcp_send_challenge_ack()  
*/
 #define FLAG_ACK_MAYBE_DELAYED 0x1 /* Likely a delayed ACK */
+#define FLAG_PACKET_DELAYED0x2 /* 0 rexmits or tstamps reveal delayed 
pkt */
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -2243,10 +2244,19 @@ static bool tcp_skb_spurious_retrans(const struct 
tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct sock *sk,
+ const u32 prior_snd_una)
 {
-   return !tp->retrans_stamp ||
-  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   if (!tp->retrans_stamp) {
+   /* Sender will be in a transient state with cleared
+* retrans_stamp during false fast retransmit prevention
+* mechanism
+*/
+   return !tcp_false_fast_retrans_possible(sk, prior_snd_una);
+   }
+   return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2336,17 +2346,20 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
bool unmark_loss)
tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct sock *sk, const int flag)
 {
-   return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return tp->undo_marker &&
+  (!tp->undo_retrans || (flag & FLAG_PACKET_DELAYED));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_may_undo(tp)) {
+   if (tcp_may_undo(sk, flag)) {
int mib_idx;
 
/* Happy end! We did not retransmit anything
@@ -2393,11 +2406,11 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo, const int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (frto_undo || tcp_may_undo(tp)) {
+   if (frto_undo || tcp_may_undo(sk, flag)) {
tcp_undo_cwnd_reduction(sk, true);
 
DBGUNDO(sk, "partial loss");
@@ -2636,7 +2649,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
bool recovered = !before(tp->snd_una, tp->high_seq);
 
if ((flag & FLAG_SND_UNA_ADVANCED) &&
-   tcp_try_undo_loss(sk, false))
+   tcp_try_undo_loss(sk, false, flag))
return;
 
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2649,7 +2662,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
 */
if ((flag & FLAG_ORIG_SACK_ACKED) &&
(tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) &&
-   tcp_try_undo_loss(sk, true))
+   tcp_try_undo_loss(sk, true, flag))
return;
 
if (after(tp->snd_nxt, tp->high_seq)) {
@@ -2672,7 +2685,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
 
if (recovered) {
/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part

[PATCH net 0/5] tcp: fixes to non-SACK TCP

2018-03-07 Thread Ilpo Järvinen

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection:

tcp: feed correct number of pkts acked to cc modules
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized window

[PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-07 Thread Ilpo Järvinen

Currently, the TCP code is overly eager to update window on
every ACK. It makes some of the ACKs that the receiver should
sent as dupACKs look like they update window update that are
not considered real dupACKs by the non-SACK sender-side code.

Make sure that when an ofo segment is received, no change to
window is applied if we are going to send a dupACK. It's ok
to change the window for non-dupACKs (such as the first ACK
after ofo arrivals start if that ACK was using delayed ACKs).

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 -
 net/ipv4/tcp_output.c | 43 +--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+   unused:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b689915..77a289f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4633,6 +4633,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
bool fragstolen;
int eaten;
 
@@ -4676,7 +4677,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
 * gap in queue is filled.
 */
if (RB_EMPTY_ROOT(>out_of_order_queue))
-   inet_csk(sk)->icsk_ack.pingpong = 0;
+   icsk->icsk_ack.pingpong = 0;
}
 
if (tp->rx_opt.num_sacks)
@@ -4726,6 +4727,8 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
goto queue_and_out;
}
 
+   if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+   tp->dupack_wnd_lock = 1;
tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
u32 old_win = tp->rcv_wnd;
-   u32 cur_win = tcp_receive_window(tp);
-   u32 new_win = __tcp_select_window(sk);
-
-   /* Never shrink the offered window */
-   if (new_win < cur_win) {
-   /* Danger Will Robinson!
-* Don't update rcv_wup/rcv_wnd here or else
-* we will not be able to advertise a zero
-* window in time.  --DaveM
-*
-* Relax Will Robinson.
-*/
-   if (new_win == 0)
-   NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPWANTZEROWINDOWADV);
-   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   u32 cur_win;
+   u32 new_win;
+
+   if (tp->dupack_wnd_lock) {
+   new_win = old_win;
+   tp->dupack_wnd_lock = 0;
+   } else {
+   cur_win = tcp_receive_window(tp);
+   new_win = __tcp_select_window(sk);
+   /* Never shrink the offered window */
+   if (new_win < cur_win) {
+   /* Danger Will Robinson!
+* Don't update rcv_wup/rcv_wnd here or else
+* we will not be able to advertise a zero
+* window in time.  --DaveM
+*
+* Relax Will Robinson.
+*/
+   if (new_win == 0)
+   NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPWANTZEROWINDOWADV);
+   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   }
+   tp->rcv_wnd = new_win;
+   tp->rcv_wup = tp->rcv_nxt;
}
-   tp->rcv_wnd = new_win;
-   tp->rcv_wup = tp->rcv_nxt;
 
/* Make sure we do not exceed the maximum possible
 * scaled window.
-- 
2.7.4

[PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-07 Thread Ilpo Järvinen

In a non-SACK case, any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case). This causes bogus undos in ordinary RTO recoveries where
segments are lost here and there, with a few delivered segments in
between losses. A cumulative ACKs will cover retransmitted ones at
the bottom and the non-retransmitted ones following that causing
FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results
in a spurious FRTO undo.

We need to make the check more strict for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted,
which would be the case for the last step of FRTO algorithm as we
sent out only new segments previously. Only then, allow FRTO undo
to proceed in non-SACK case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi>
---
 net/ipv4/tcp_input.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0305f6d..1a33752 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2629,8 +2629,13 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
/* Step 3.b. A timeout is spurious if not all data are
 * lost, i.e., never-retransmitted data are (s)acked.
+*
+* As the non-SACK case does not keep track of 
TCPCB_SACKED_ACKED,
+* we need to ensure that none of the cumulative ACKed segments
+* was retransmitted to confirm the validity of 
FLAG_ORIG_SACK_ACKED.
 */
if ((flag & FLAG_ORIG_SACK_ACKED) &&
+   (tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) &&
tcp_try_undo_loss(sk, true))
return;
 
-- 
2.7.4

Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue

2016-09-08 Thread Ilpo Järvinen

) {
> - struct sk_buff *next = NULL;
>  
> - if (!skb_queue_is_last(>out_of_order_queue, skb))
> - next = skb_queue_next(>out_of_order_queue, skb);
> - skb = next;
> + for (head = skb;;) {
> + skb = tcp_skb_next(skb, NULL);
>  
> - /* Segment is terminated when we see gap or when
> -  * we are at the end of all the queue. */
> + /* Range is terminated when we see a gap or when
> +  * we are at the queue end.
> +  */
>   if (!skb ||
>   after(TCP_SKB_CB(skb)->seq, end) ||
>   before(TCP_SKB_CB(skb)->end_seq, start)) {
> - tcp_collapse(sk, >out_of_order_queue,
> + tcp_collapse(sk, NULL, >out_of_order_queue,
>head, skb, start, end);
> - head = skb;
> - if (!skb)
> - break;
> - /* Start new segment */
> + goto new_range;
> + }
> +
> + if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
>   start = TCP_SKB_CB(skb)->seq;
> + if (after(TCP_SKB_CB(skb)->end_seq, end))
>   end = TCP_SKB_CB(skb)->end_seq;
> - } else {
> - if (before(TCP_SKB_CB(skb)->seq, start))
> - start = TCP_SKB_CB(skb)->seq;
> - if (after(TCP_SKB_CB(skb)->end_seq, end))
> - end = TCP_SKB_CB(skb)->end_seq;
> - }
>   }
>  }

I tried long to think if I could propose alternative layout which would 
make this function to exit from the end but couldn't come up anything 
sensible. As is, it's always exiting within that top if block which is
somewhat unintuitive :-).

Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki>

-- 
 i.

Re: [RFC PATCHv2 net-next 1/2] tcp: RTO Restart (RTOR)

2015-12-08 Thread Ilpo Järvinen

On Tue, 8 Dec 2015, Per Hurtig wrote:

> This patch implements the RTO restart modification (RTOR). When data is
> ACKed, and the RTO timer is restarted, the time elapsed since the last
> outstanding segment was transmitted is subtracted from the calculated RTO
> value. This way, the RTO timer will expire after exactly RTO seconds, and
> not RTO + RTT [+ delACK] seconds.
> 
> This patch also implements a new sysctl (tcp_timer_restart) that is used
> to control the timer restart behavior.
> 
> Signed-off-by: Per Hurtig 
> ---
>  Documentation/networking/ip-sysctl.txt | 12 
>  include/net/tcp.h  |  6 ++
>  net/ipv4/sysctl_net_ipv4.c | 10 ++
>  net/ipv4/tcp_input.c   | 29 +
>  4 files changed, 57 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 2ea4c45..4094128 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>   with the current initial RTO of 1second. With this the final timeout
>   for an active TCP connection attempt will happen after 127seconds.
>  
> +tcp_timer_restart - INTEGER
> + Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
> + If set (per timer or combined) the timers are restarted with
> + respect to the earliest outstanding segment, to not extend tail loss
> + latency unnecessarily.
> + Possible values:
> + 0 disables RTOR and TLPR.
> + 1 enables RTOR.
> + 2 enables TLPR.
> + 3 enables RTOR and TLPR.
> + Default: 3
> +
>  tcp_timestamps - BOOLEAN
>   Enable timestamps as defined in RFC1323.
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index f80e74c..833efb7 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -76,6 +76,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  /* After receiving this amount of duplicate ACKs fast retransmit starts. */
>  #define TCP_FASTRETRANS_THRESH 3
>  
> +/* Disable RTO Restart if the number of outstanding segments is at least. */
> +#define TCP_TIMER_RTORESTART 1
> +#define TCP_TIMER_TLPRESTART 2
> +#define TCP_RTORESTART_THRESH4

Unfortunately the comment got now separated from the actual define.


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next 1/2] tcp: RTO Restart (RTOR)

2015-12-07 Thread Ilpo Järvinen

On Mon, 7 Dec 2015, Per Hurtig wrote:

> This patch implements the RTO restart modification (RTOR). When data is
> ACKed, and the RTO timer is restarted, the time elapsed since the last
> outstanding segment was transmitted is subtracted from the calculated RTO
> value. This way, the RTO timer will expire after exactly RTO seconds, and
> not RTO + RTT [+ delACK] seconds.
> 
> This patch also implements a new sysctl (tcp_timer_restart) that is used
> to control the timer restart behavior.
> 
> Signed-off-by: Per Hurtig 
> ---
>  Documentation/networking/ip-sysctl.txt | 12 
>  include/net/tcp.h  |  4 
>  net/ipv4/sysctl_net_ipv4.c | 10 ++
>  net/ipv4/tcp_input.c   | 24 
>  4 files changed, 50 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 2ea4c45..4094128 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -591,6 +591,18 @@ tcp_syn_retries - INTEGER
>   with the current initial RTO of 1second. With this the final timeout
>   for an active TCP connection attempt will happen after 127seconds.
>  
> +tcp_timer_restart - INTEGER
> + Controls how the RTO and PTO timers are restarted (RTOR and TLPR).
> + If set (per timer or combined) the timers are restarted with
> + respect to the earliest outstanding segment, to not extend tail loss
> + latency unnecessarily.
> + Possible values:
> + 0 disables RTOR and TLPR.
> + 1 enables RTOR.
> + 2 enables TLPR.
> + 3 enables RTOR and TLPR.
> + Default: 3
> +
>  tcp_timestamps - BOOLEAN
>   Enable timestamps as defined in RFC1323.
>  

[...snip...]

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index fdd88c3..66e0425 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c

[...snip...]

>  /* Restart timer after forward progress on connection.
>   * RFC2988 recommends to restart timer to now+rto.
>   */
> @@ -3027,6 +3040,17 @@ void tcp_rearm_rto(struct sock *sk)
>*/
>   if (delta > 0)
>   rto = delta;
> + } else if (icsk->icsk_pending == ICSK_TIME_RETRANS &&
> +(sysctl_tcp_timer_restart == 1 ||
> + sysctl_tcp_timer_restart == 3) &&

Use a bit operation here instead? Also I think that this sysctl would 
benefit from named constants rather than use of literals (similar 
comment applies to the other patch too).

> +(tp->packets_out + tcp_unsent_pkts(sk) <
> + TCP_RTORESTART_THRESH)) {
> + struct sk_buff *skb = tcp_write_queue_head(sk);
> + const u32 rto_time_stamp = tcp_skb_timestamp(skb);
> + s32 delta = (s32)(tcp_time_stamp - rto_time_stamp);
> +
> + if (delta > 0 && rto > delta)
> + rto -= delta;
>   }
>   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
> TCP_RTO_MAX);


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-23 Thread Ilpo Järvinen

On Sat, 23 Feb 2008, Andrew Morton wrote:

 On Wed, 20 Feb 2008 15:47:18 +0200 Ilpo Järvinen [EMAIL PROTECTED] wrote:
 
  vmlinux.o:
   62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869
  
  ...+ these to lib/jhash.o:
   jhash_3words: 112
   jhash2: 276
   jhash: 475
  
  select for networking code might need a more fine-grained approach.
 
 It should be possible to use a modular jhash.ko.  The things which you
 have identified as clients of the jhash library are usually loaded as modules.
 But in the case where someone does (say) NFSD=y we do need jhash.o linked into
 vmlinux also. This is doable in Kconfig but I always forget how.

Ok, even though its not that likely that one lives without e.g.
net/ipv4/inet_connection_sock.c or net/netlink/af_netlink.c? But maybe 
some guys really know what they are doing and can come up with config 
that would be able to build it as module (for other than proof-of-concept 
uses I mean)... :-/

 Adrian, Sam and Randy are the repositories of knowledge here ;)

Thanks, I'll consult them in this. I've never needed to do any Kconfig 
stuff so far so it's no surprise I have very little clue... :-)

I've one question for you Andrew, how would you like this kind of 
cross-subsys toucher to be merged, through you directly I suppose?


-- 
 i.

Re: [RFC PATCH 0/8]: uninline uninline

2008-02-23 Thread Ilpo Järvinen

On Sat, 23 Feb 2008, Andrew Morton wrote:

 On Wed, 20 Feb 2008 15:47:10 +0200 Ilpo J__rvinen [EMAIL PROTECTED] wrote:
 
  -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
 
 This is a surprise. 

It surprised me as well, there were something like 10 bytes I just 
couldn't explain in IS_ERR size (kernel/uninlined.c: IS_ERR | +25). I was 
to look into it deeper but didn't have the .o's at hand right away, not so 
trivial to store results of 5000+ build results except some carefully 
picked textual output :-)... Hmm, I'll add logging for the disassembly of 
the uninlined stuff into the next run, that won't cost too much...

 I expect that the -mm-only
 profile-likely-unlikely-macros.patch is the cause of this and mainline
 doesn't have this problem.

Ahaa, this explain it, I suspected that there was something (un)likely 
related elsewhere as well, no wonder why I couldn't reproduce the 25 bytes 
result in my quick copy-pasted non-kernel test...

 If true, then this likely/unlikely bloat has probably spread into a lot of
 your other results and it all should be redone against mainline, sorry :(

It isn't that troublesome to redo them, it's mainly automatic combined 
with impatient waiting from my behalf :-)... The spreading problem is 
probably true, to some extent. I did some runs also with carefully 
selected CONFIG.*DEBUG.* off under include/net/ earlier, in general it 
made very little difference, if something bloats, it usually does that 
regardless of .config. There are probably couple of exceptions when the 
size is on the boundary where it's very close of being useful to uninline 
it.

One interesting thing in there was that the largest offenders are quite 
small per call-site but due to vast number of them even a small benefit 
buys off a lot in kernel wide results. I suspect the differences due to 
(un)likely will be negligle, because the IS_ERR with all its trivialness 
is now mostly -15, so anything clearly larger than that will likely still 
be a win x n (where n is quite large).

Anyway, I'll see when I get the new set of tests running... :-) I'd prefer 
with CONFIG.*DEBUG off to get larger picture about non-debug builds too
(especially the scatterlist.h things interest me a lot), do you think that 
would do as well?

Thanks for your comments, I found them very useful.


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8]: uninline uninline

2008-02-23 Thread Ilpo Järvinen

On Sat, 23 Feb 2008, Andi Kleen wrote:

 Andrew Morton [EMAIL PROTECTED] writes:
 
 
  -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
 
  This is a surprise.  I expect that the -mm-only
  profile-likely-unlikely-macros.patch is the cause of this and mainline
  doesn't have this problem.
 
 Shouldn't they only have overhead when the respective CONFIG is enabled?

I uninlined those with allyesconfig. I'll anyway try later on without a 
number of debug related CONFIGs.


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/8]: uninline uninline

2008-02-20 Thread Ilpo Järvinen

Hi all,

I run some lengthy tests to measure cost of inlines in headers under
include/, simple coverage calculations yields to 89% but most of the
failed compiles are due to preprocessor cutting the tested block away
anyway. Test setup: v2.6.24-mm1, make allyesconfig, 32-bit x86,
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Because one inline was
tested (function uninlined) at a time, the actual benefits of removing
multiple inlines may well be below what the sum of those individually
is (especially when something calls __-func with equal name).

Ok, here's the top of the list (1+ bytes):

-110805  869 f, 198 +, 111003 -, diff: -110805  skb_put 
-41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
-36290  42 f, 197 +, 36487 -, diff: -36290  cfi_build_cmd 
-35698  1234 f, 2391 +, 38089 -, diff: -35698  atomic_dec_and_test 
-28162  354 f, 3005 +, 31167 -, diff: -28162  skb_pull 
-23668  392 f, 104 +, 23772 -, diff: -23668  dev_alloc_skb 
-22212  415 f, 130 +, 22342 -, diff: -22212  __dev_alloc_skb 
-21593  356 f, 2418 +, 24011 -, diff: -21593  skb_push 
-19036  478 f, 259 +, 19295 -, diff: -19036  netif_wake_queue 
-18409  396 f, 6447 +, 24856 -, diff: -18409  __skb_pull 
-16420  187 f, 103 +, 16523 -, diff: -16420  dst_release 
-16025  13 f, 280 +, 16305 -, diff: -16025  cfi_send_gen_cmd 
-14925  486 f, 978 +, 15903 -, diff: -14925  add_timer 
-14896  199 f, 558 +, 15454 -, diff: -14896  sg_page 
-12870  36 f, 121 +, 12991 -, diff: -12870  le_key_k_type 
-12310  692 f, 7215 +, 19525 -, diff: -12310  signal_pending 
-11640  251 f, 118 +, 11758 -, diff: -11640  __skb_trim 
-11059  111 f, 293 +, 11352 -, diff: -11059  __nlmsg_put 
-10976  209 f, 123 +, 11099 -, diff: -10976  skb_trim 
-10344  125 f, 462 +, 10806 -, diff: -10344  pskb_may_pull 
-10061  300 f, 1163 +, 11224 -, diff: -10061  try_module_get 
-10008  75 f, 341 +, 10349 -, diff: -10008  nlmsg_put 

~250 are in 1000+ bytes category and ~440 in 500+. Full list
has some entries without number because given config doesn't
build them, and therefore nothing got uninlined, and the missing
entries consists solely of compile failures, available here:

  http://www.cs.helsinki.fi/u/ijjarvin/inlines/sorted

I made some patches to uninline couple of them (picked mostly
net related) to stir up some discussion, however, some of them
are not ready for inclusion as is (see patch descriptions).
The cases don't represent all top 8 cases because some of the
cases require a bit more analysis (e.g., config dependant,
comments about gcc optimizations).

The tools I used are available here except the site-specific
distribute machinery (in addition one needs pretty late
codiff from Arnaldo's toolset because there were some inline
related bugs fixed lately):

  http://www.cs.helsinki.fi/u/ijjarvin/inline-tools.git/

I'm planning to run similar tests also on inlines in headers that
are not under include/ but it requires minor modifications to
those tools.

--
 i.



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/8]: uninline uninline

2008-02-20 Thread Ilpo Järvinen

Hi all,

I run some lengthy tests to measure cost of inlines in headers under
include/, simple coverage calculations yields to 89% but most of the
failed compiles are due to preprocessor cutting the tested block away
anyway. Test setup: v2.6.24-mm1, make allyesconfig, 32-bit x86,
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Because one inline was
tested (function uninlined) at a time, the actual benefits of removing
multiple inlines may well be below what the sum of those individually
is (especially when something calls __-func with equal name).

Ok, here's the top of the list (1+ bytes):

-110805  869 f, 198 +, 111003 -, diff: -110805  skb_put 
-41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
-36290  42 f, 197 +, 36487 -, diff: -36290  cfi_build_cmd 
-35698  1234 f, 2391 +, 38089 -, diff: -35698  atomic_dec_and_test 
-28162  354 f, 3005 +, 31167 -, diff: -28162  skb_pull 
-23668  392 f, 104 +, 23772 -, diff: -23668  dev_alloc_skb 
-22212  415 f, 130 +, 22342 -, diff: -22212  __dev_alloc_skb 
-21593  356 f, 2418 +, 24011 -, diff: -21593  skb_push 
-19036  478 f, 259 +, 19295 -, diff: -19036  netif_wake_queue 
-18409  396 f, 6447 +, 24856 -, diff: -18409  __skb_pull 
-16420  187 f, 103 +, 16523 -, diff: -16420  dst_release 
-16025  13 f, 280 +, 16305 -, diff: -16025  cfi_send_gen_cmd 
-14925  486 f, 978 +, 15903 -, diff: -14925  add_timer 
-14896  199 f, 558 +, 15454 -, diff: -14896  sg_page 
-12870  36 f, 121 +, 12991 -, diff: -12870  le_key_k_type 
-12310  692 f, 7215 +, 19525 -, diff: -12310  signal_pending 
-11640  251 f, 118 +, 11758 -, diff: -11640  __skb_trim 
-11059  111 f, 293 +, 11352 -, diff: -11059  __nlmsg_put 
-10976  209 f, 123 +, 11099 -, diff: -10976  skb_trim 
-10344  125 f, 462 +, 10806 -, diff: -10344  pskb_may_pull 
-10061  300 f, 1163 +, 11224 -, diff: -10061  try_module_get 
-10008  75 f, 341 +, 10349 -, diff: -10008  nlmsg_put 

~250 are in 1000+ bytes category and ~440 in 500+. Full list
has some entries without number because given config doesn't
build them, and therefore nothing got uninlined, and the missing
entries consists solely of compile failures, available here:

  http://www.cs.helsinki.fi/u/ijjarvin/inlines/sorted

I made some patches to uninline couple of them (picked mostly
net related) to stir up some discussion, however, some of them
are not ready for inclusion as is (see patch descriptions).
The cases don't represent all top 8 cases because some of the
cases require a bit more analysis (e.g., config dependant,
comments about gcc optimizations).

The tools I used are available here except the site-specific
distribute machinery (in addition one needs pretty late
codiff from Arnaldo's toolset because there were some inline
related bugs fixed lately):

  http://www.cs.helsinki.fi/u/ijjarvin/inline-tools.git/

I'm planning to run similar tests also on inlines in headers that
are not under include/ but it requires minor modifications to
those tools.

--
 i.

ps. I'm sorry about the duplicates, old git-send-email's
8-bit-header problem bit me again. :-(



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

~500 files changed
...
kernel/uninlined.c:
  skb_put   | +104
 1 function changed, 104 bytes added, diff: +104

vmlinux.o:
 869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805

This change is INCOMPLETE, I think that the call to current_text_addr()
should be rethought but I don't have a clue how to do that.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/skbuff.h |   20 +---
 net/core/skbuff.c  |   21 +
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 412672a..5925435 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -896,25 +896,7 @@ static inline unsigned char *__skb_put(struct sk_buff 
*skb, unsigned int len)
return tmp;
 }
 
-/**
- * skb_put - add data to a buffer
- * @skb: buffer to use
- * @len: amount of data to add
- *
- * This function extends the used data area of the buffer. If this would
- * exceed the total buffer size the kernel will panic. A pointer to the
- * first byte of the extra data is returned.
- */
-static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
-{
-   unsigned char *tmp = skb_tail_pointer(skb);
-   SKB_LINEAR_ASSERT(skb);
-   skb-tail += len;
-   skb-len  += len;
-   if (unlikely(skb-tail  skb-end))
-   skb_over_panic(skb, len, current_text_addr());
-   return tmp;
-}
+extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
 
 static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e35422..661d439 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -857,6 +857,27 @@ free_skb:
return err;
 }
 
+/**
+ * skb_put - add data to a buffer
+ * @skb: buffer to use
+ * @len: amount of data to add
+ *
+ * This function extends the used data area of the buffer. If this would
+ * exceed the total buffer size the kernel will panic. A pointer to the
+ * first byte of the extra data is returned.
+ */
+unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
+{
+   unsigned char *tmp = skb_tail_pointer(skb);
+   SKB_LINEAR_ASSERT(skb);
+   skb-tail += len;
+   skb-len  += len;
+   if (unlikely(skb-tail  skb-end))
+   skb_over_panic(skb, len, current_text_addr());
+   return tmp;
+}
+EXPORT_SYMBOL(skb_put);
+
 /* Trims skb to length len. It can change skb pointers.
  */
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/8] [NET]: uninline skb_push, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

-21593  356 funcs, 2418 +, 24011 -, diff: -21593 --- skb_push

Again, current_text_addr() needs to addressed.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/skbuff.h |   18 +-
 net/core/skbuff.c  |   19 +++
 2 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df3cce2..c11f248 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -905,23 +905,7 @@ static inline unsigned char *__skb_push(struct sk_buff 
*skb, unsigned int len)
return skb-data;
 }
 
-/**
- * skb_push - add data to the start of a buffer
- * @skb: buffer to use
- * @len: amount of data to add
- *
- * This function extends the used data area of the buffer at the buffer
- * start. If this would exceed the total buffer headroom the kernel will
- * panic. A pointer to the first byte of the extra data is returned.
- */
-static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
-{
-   skb-data -= len;
-   skb-len  += len;
-   if (unlikely(skb-dataskb-head))
-   skb_under_panic(skb, len, current_text_addr());
-   return skb-data;
-}
+extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
 
 static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 081bffb..05d43fd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -897,6 +897,25 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int 
len)
 EXPORT_SYMBOL(skb_put);
 
 /**
+ * skb_push - add data to the start of a buffer
+ * @skb: buffer to use
+ * @len: amount of data to add
+ *
+ * This function extends the used data area of the buffer at the buffer
+ * start. If this would exceed the total buffer headroom the kernel will
+ * panic. A pointer to the first byte of the extra data is returned.
+ */
+unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
+{
+   skb-data -= len;
+   skb-len  += len;
+   if (unlikely(skb-dataskb-head))
+   skb_under_panic(skb, len, current_text_addr());
+   return skb-data;
+}
+EXPORT_SYMBOL(skb_push);
+
+/**
  * skb_pull - remove data from the start of a buffer
  * @skb: buffer to use
  * @len: amount of data to remove
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

-23668  392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/skbuff.h |   17 +
 net/core/skbuff.c  |   18 ++
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a9f8f15..df3cce2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1269,22 +1269,7 @@ static inline struct sk_buff *__dev_alloc_skb(unsigned 
int length,
return skb;
 }
 
-/**
- * dev_alloc_skb - allocate an skbuff for receiving
- * @length: length to allocate
- *
- * Allocate a new sk_buff and assign it a usage count of one. The
- * buffer has unspecified headroom built in. Users should allocate
- * the headroom they think they need without accounting for the
- * built in space. The built in space is used for optimisations.
- *
- * %NULL is returned if there is no free memory. Although this function
- * allocates memory it can be called from an interrupt.
- */
-static inline struct sk_buff *dev_alloc_skb(unsigned int length)
-{
-   return __dev_alloc_skb(length, GFP_ATOMIC);
-}
+extern struct sk_buff *dev_alloc_skb(unsigned int length);
 
 extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int length, gfp_t gfp_mask);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 14f462b..081bffb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
return skb;
 }
 
+/**
+ * dev_alloc_skb - allocate an skbuff for receiving
+ * @length: length to allocate
+ *
+ * Allocate a new sk_buff and assign it a usage count of one. The
+ * buffer has unspecified headroom built in. Users should allocate
+ * the headroom they think they need without accounting for the
+ * built in space. The built in space is used for optimisations.
+ *
+ * %NULL is returned if there is no free memory. Although this function
+ * allocates memory it can be called from an interrupt.
+ */
+struct sk_buff *dev_alloc_skb(unsigned int length)
+{
+   return __dev_alloc_skb(length, GFP_ATOMIC);
+}
+EXPORT_SYMBOL(dev_alloc_skb);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
struct sk_buff *list = *listp;
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/8] [NET]: uninline skb_pull, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

-28162  354 funcs, 3005 +, 31167 -, diff: -28162 --- skb_pull

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/skbuff.h |   15 +--
 net/core/skbuff.c  |   16 
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5925435..a9f8f15 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -930,20 +930,7 @@ static inline unsigned char *__skb_pull(struct sk_buff 
*skb, unsigned int len)
return skb-data += len;
 }
 
-/**
- * skb_pull - remove data from the start of a buffer
- * @skb: buffer to use
- * @len: amount of data to remove
- *
- * This function removes data from the start of a buffer, returning
- * the memory to the headroom. A pointer to the next data in the buffer
- * is returned. Once the data has been pulled future pushes will overwrite
- * the old data.
- */
-static inline unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
-{
-   return unlikely(len  skb-len) ? NULL : __skb_pull(skb, len);
-}
+extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
 
 extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 661d439..14f462b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -878,6 +878,22 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int 
len)
 }
 EXPORT_SYMBOL(skb_put);
 
+/**
+ * skb_pull - remove data from the start of a buffer
+ * @skb: buffer to use
+ * @len: amount of data to remove
+ *
+ * This function removes data from the start of a buffer, returning
+ * the memory to the headroom. A pointer to the next data in the buffer
+ * is returned. Once the data has been pulled future pushes will overwrite
+ * the old data.
+ */
+unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
+{
+   return unlikely(len  skb-len) ? NULL : __skb_pull(skb, len);
+}
+EXPORT_SYMBOL(skb_pull);
+
 /* Trims skb to length len. It can change skb pointers.
  */
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 6/8] [NET]: uninline skb_trim, de-bloats

2008-02-20 Thread Ilpo Järvinen

-10976  209 funcs, 123 +, 11099 -, diff: -10976 --- skb_trim

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/skbuff.h |   16 +---
 net/core/skbuff.c  |   16 
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c11f248..75d8a66 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1156,21 +1156,7 @@ static inline void __skb_trim(struct sk_buff *skb, 
unsigned int len)
skb_set_tail_pointer(skb, len);
 }
 
-/**
- * skb_trim - remove end from a buffer
- * @skb: buffer to alter
- * @len: new length
- *
- * Cut the length of a buffer down by removing data from the tail. If
- * the buffer is already under the length specified it is not modified.
- * The skb must be linear.
- */
-static inline void skb_trim(struct sk_buff *skb, unsigned int len)
-{
-   if (skb-len  len)
-   __skb_trim(skb, len);
-}
-
+extern void skb_trim(struct sk_buff *skb, unsigned int len);
 
 static inline int __pskb_trim(struct sk_buff *skb, unsigned int len)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 05d43fd..b57cadb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -931,6 +931,22 @@ unsigned char *skb_pull(struct sk_buff *skb, unsigned int 
len)
 }
 EXPORT_SYMBOL(skb_pull);
 
+/**
+ * skb_trim - remove end from a buffer
+ * @skb: buffer to alter
+ * @len: new length
+ *
+ * Cut the length of a buffer down by removing data from the tail. If
+ * the buffer is already under the length specified it is not modified.
+ * The skb must be linear.
+ */
+void skb_trim(struct sk_buff *skb, unsigned int len)
+{
+   if (skb-len  len)
+   __skb_trim(skb, len);
+}
+EXPORT_SYMBOL(skb_trim);
+
 /* Trims skb to length len. It can change skb pointers.
  */
 
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 7/8] [SCTP]: uninline sctp_add_cmd_sf

2008-02-20 Thread Ilpo Järvinen

I added inline to sctp_add_cmd and appropriate comment there to
avoid adding another call into the call chain. This works at least
with gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Alternatively,
__sctp_add_cmd could be introduced to .h.

net/sctp/sm_statefuns.c:
  sctp_sf_cookie_wait_prm_abort  | -125
  sctp_sf_cookie_wait_prm_shutdown   |  -75
  sctp_sf_do_9_1_prm_abort   |  -75
  sctp_sf_shutdown_sent_prm_abort|  -50
  sctp_sf_pdiscard   |  -25
  sctp_stop_t1_and_abort | -100
  sctp_sf_do_9_2_start_shutdown  | -154
  __sctp_sf_do_9_1_abort |  -50
  sctp_send_stale_cookie_err |  -29
  sctp_sf_abort_violation| -181
  sctp_sf_do_9_2_shutdown_ack| -154
  sctp_sf_do_9_2_reshutack   |  -86
  sctp_sf_tabort_8_4_8   |  -28
  sctp_sf_heartbeat  |  -52
  sctp_sf_shut_8_4_5 |  -27
  sctp_eat_data  | -246
  sctp_sf_shutdown_sent_abort|  -58
  sctp_sf_check_restart_addrs|  -50
  sctp_sf_do_unexpected_init | -110
  sctp_sf_sendbeat_8_3   | -107
  sctp_sf_unk_chunk  |  -65
  sctp_sf_do_prm_asoc| -129
  sctp_sf_do_prm_send|  -25
  sctp_sf_do_9_2_prm_shutdown|  -50
  sctp_sf_error_closed   |  -25
  sctp_sf_error_shutdown |  -25
  sctp_sf_shutdown_pending_prm_abort |  -25
  sctp_sf_do_prm_requestheartbeat|  -28
  sctp_sf_do_prm_asconf  |  -75
  sctp_sf_do_6_3_3_rtx   | -104
  sctp_sf_do_6_2_sack|  -25
  sctp_sf_t1_init_timer_expire   | -133
  sctp_sf_t1_cookie_timer_expire | -104
  sctp_sf_t2_timer_expire| -161
  sctp_sf_t4_timer_expire| -175
  sctp_sf_t5_timer_expire|  -75
  sctp_sf_autoclose_timer_expire |  -50
  sctp_sf_do_5_2_4_dupcook   | -579
  sctp_sf_do_4_C | -125
  sctp_sf_shutdown_pending_abort |  -32
  sctp_sf_do_5_1E_ca | -186
  sctp_sf_backbeat_8_3   |  -27
  sctp_sf_cookie_echoed_err  | -300
  sctp_sf_eat_data_6_2   | -146
  sctp_sf_eat_data_fast_4_4  | -125
  sctp_sf_eat_sack_6_2   |  -29
  sctp_sf_operr_notify   |  -25
  sctp_sf_do_9_2_final   | -152
  sctp_sf_do_asconf  |  -64
  sctp_sf_do_asconf_ack  | -284
  sctp_sf_eat_fwd_tsn_fast   | -160
  sctp_sf_eat_auth   |  -86
  sctp_sf_do_5_1B_init   | -110
  sctp_sf_do_5_1C_ack| -204
  sctp_sf_do_9_2_shutdown|  -78
  sctp_sf_do_ecn_cwr |  -24
  sctp_sf_do_ecne|  -32
  sctp_sf_eat_fwd_tsn| -135
  sctp_sf_do_5_1D_ce | -197
  sctp_sf_beat_8_3   |  -28
 60 functions changed, 6184 bytes removed, diff: -6184
net/sctp/sm_sideeffect.c:
  sctp_side_effects | -3873
  sctp_do_sm| +3429
 2 functions changed, 3429 bytes added, 3873 bytes removed, diff: -444
kernel/uninlined.c:
  sctp_add_cmd_sf   |  +35
 1 function changed, 35 bytes added, diff: +35

vmlinux.o:
 63 functions changed, 3464 bytes added, 10057 bytes removed, diff: -6593

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
Cc: Vlad Yasevich [EMAIL PROTECTED]
---
 include/net/sctp/sm.h |8 ++--
 net/sctp/command.c|   12 +++-
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index ef9e7ed..6740b11 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -385,13 +385,9 @@ static inline int ADDIP_SERIAL_gte(__u16 s, __u16 t)
return (((s) == (t)) || (((t) - (s))  ADDIP_SERIAL_SIGN_BIT));
 }
 
-
 /* Run sctp_add_cmd() generating a BUG() if there is a failure.  */
-static inline void sctp_add_cmd_sf(sctp_cmd_seq_t *seq, sctp_verb_t verb, 
sctp_arg_t obj)
-{
-   if (unlikely(!sctp_add_cmd(seq, verb, obj)))
-   BUG();
-}
+extern void sctp_add_cmd_sf(sctp_cmd_seq_t *seq, sctp_verb_t verb,
+   sctp_arg_t obj);
 
 /* Check VTAG of the packet matches the sender's own tag. */
 static inline int
diff --git a/net/sctp/command.c b/net/sctp/command.c
index bb97733..187da2d 100644
--- a/net/sctp/command.c
+++ b/net/sctp/command.c
@@ -51,8 +51,11 @@ int sctp_init_cmd_seq(sctp_cmd_seq_t *seq)
 
 /* Add a command to a sctp_cmd_seq_t.
  * Return 0 if the command sequence is full.
+ *
+ * Inline here is not a mistake, this way sctp_add_cmd_sf doesn't need extra
+ * calls, size penalty is of insignificant magnitude here
  */
-int sctp_add_cmd(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj)
+inline int sctp_add_cmd(sctp_cmd_seq_t *seq, sctp_verb_t verb, sctp_arg_t obj)
 {
if (seq-next_free_slot = SCTP_MAX_NUM_COMMANDS)
goto fail;
@@ -66,6 +69,13 @@ fail:
return 0;
 }
 
+/* Run

[RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-20 Thread Ilpo Järvinen

vmlinux.o:
 62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869

...+ these to lib/jhash.o:
 jhash_3words: 112
 jhash2: 276
 jhash: 475

select for networking code might need a more fine-grained approach.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 drivers/infiniband/Kconfig |1 +
 drivers/net/Kconfig|1 +
 fs/Kconfig |1 +
 fs/dlm/Kconfig |1 +
 fs/gfs2/Kconfig|1 +
 include/linux/jhash.h  |   99 +
 init/Kconfig   |2 +
 lib/Kconfig|6 ++
 lib/Makefile   |1 +
 lib/jhash.c|  116 
 net/Kconfig|1 +
 11 files changed, 134 insertions(+), 96 deletions(-)
 create mode 100644 lib/jhash.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a5dc78a..421ab71 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -2,6 +2,7 @@ menuconfig INFINIBAND
tristate InfiniBand support
depends on PCI || BROKEN
depends on HAS_IOMEM
+   select JHASH
---help---
  Core support for InfiniBand (IB).  Make sure to also select
  any protocols you wish to use as well as drivers for your
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index f337800..8257648 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2496,6 +2496,7 @@ config CHELSIO_T3
tristate Chelsio Communications T3 10Gb Ethernet support
depends on PCI
select FW_LOADER
+   select JHASH
help
  This driver supports Chelsio T3-based gigabit and 10Gb Ethernet
  adapters.
diff --git a/fs/Kconfig b/fs/Kconfig
index d731282..693fe71 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1667,6 +1667,7 @@ config NFSD
select LOCKD
select SUNRPC
select EXPORTFS
+   select JHASH
select NFSD_V2_ACL if NFSD_V3_ACL
select NFS_ACL_SUPPORT if NFSD_V2_ACL
select NFSD_TCP if NFSD_V4
diff --git a/fs/dlm/Kconfig b/fs/dlm/Kconfig
index 2dbb422..f27a99a 100644
--- a/fs/dlm/Kconfig
+++ b/fs/dlm/Kconfig
@@ -4,6 +4,7 @@ menuconfig DLM
depends on SYSFS  (IPV6 || IPV6=n)
select CONFIGFS_FS
select IP_SCTP
+   select JHASH
help
A general purpose distributed lock manager for kernel or userspace
applications.
diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig
index de8e64c..b9dcabf 100644
--- a/fs/gfs2/Kconfig
+++ b/fs/gfs2/Kconfig
@@ -3,6 +3,7 @@ config GFS2_FS
depends on EXPERIMENTAL
select FS_POSIX_ACL
select CRC32
+   select JHASH
help
  A cluster filesystem.
 
diff --git a/include/linux/jhash.h b/include/linux/jhash.h
index 2a2f99f..14200c6 100644
--- a/include/linux/jhash.h
+++ b/include/linux/jhash.h
@@ -1,25 +1,6 @@
 #ifndef _LINUX_JHASH_H
 #define _LINUX_JHASH_H
 
-/* jhash.h: Jenkins hash support.
- *
- * Copyright (C) 1996 Bob Jenkins ([EMAIL PROTECTED])
- *
- * http://burtleburtle.net/bob/hash/
- *
- * These are the credits from Bob's sources:
- *
- * lookup2.c, by Bob Jenkins, December 1996, Public Domain.
- * hash(), hash2(), hash3, and mix() are externally useful functions.
- * Routines to test the hash are included if SELF_TEST is defined.
- * You can use this free for any purpose.  It has no warranty.
- *
- * Copyright (C) 2003 David S. Miller ([EMAIL PROTECTED])
- *
- * I've modified Bob's hash to be useful in the Linux kernel, and
- * any bugs present are surely my fault.  -DaveM
- */
-
 /* NOTE: Arguments are modified. */
 #define __jhash_mix(a, b, c) \
 { \
@@ -41,77 +22,12 @@
  * of bytes.  No alignment or length assumptions are made about
  * the input key.
  */
-static inline u32 jhash(const void *key, u32 length, u32 initval)
-{
-   u32 a, b, c, len;
-   const u8 *k = key;
-
-   len = length;
-   a = b = JHASH_GOLDEN_RATIO;
-   c = initval;
-
-   while (len = 12) {
-   a += (k[0] +((u32)k[1]8) +((u32)k[2]16) +((u32)k[3]24));
-   b += (k[4] +((u32)k[5]8) +((u32)k[6]16) +((u32)k[7]24));
-   c += (k[8] +((u32)k[9]8) +((u32)k[10]16)+((u32)k[11]24));
-
-   __jhash_mix(a,b,c);
-
-   k += 12;
-   len -= 12;
-   }
-
-   c += length;
-   switch (len) {
-   case 11: c += ((u32)k[10]24);
-   case 10: c += ((u32)k[9]16);
-   case 9 : c += ((u32)k[8]8);
-   case 8 : b += ((u32)k[7]24);
-   case 7 : b += ((u32)k[6]16);
-   case 6 : b += ((u32)k[5]8);
-   case 5 : b += k[4];
-   case 4 : a += ((u32)k[3]24);
-   case 3 : a += ((u32)k[2]16);
-   case 2 : a += ((u32)k[1]8);
-   case 1 : a += k[0];
-   };
-
-   __jhash_mix(a,b,c);
-
-   return c;
-}
+extern u32 jhash(const void *key, u32 length, u32 initval);
 
 /* A special optimized version that handles 1 or more of u32s

[RFC PATCH 5/8] [NET]: uninline dst_release

2008-02-20 Thread Ilpo Järvinen

Codiff stats:
-16420  187 funcs, 103 +, 16523 -, diff: -16420 --- dst_release

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/net/dst.h |   10 +-
 net/core/dst.c|   10 ++
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index e3ac7d0..bf33471 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -158,15 +158,7 @@ struct dst_entry * dst_clone(struct dst_entry * dst)
return dst;
 }
 
-static inline
-void dst_release(struct dst_entry * dst)
-{
-   if (dst) {
-   WARN_ON(atomic_read(dst-__refcnt)  1);
-   smp_mb__before_atomic_dec();
-   atomic_dec(dst-__refcnt);
-   }
-}
+extern void dst_release(struct dst_entry *dst);
 
 /* Children define the path of the packet through the
  * Linux networking.  Thus, destinations are stackable.
diff --git a/net/core/dst.c b/net/core/dst.c
index 7deef48..cc2e724 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -259,6 +259,16 @@ again:
return NULL;
 }
 
+void dst_release(struct dst_entry *dst)
+{
+   if (dst) {
+   WARN_ON(atomic_read(dst-__refcnt)  1);
+   smp_mb__before_atomic_dec();
+   atomic_dec(dst-__refcnt);
+   }
+}
+EXPORT_SYMBOL(dst_release);
+
 /* Dirty hack. We did it in 2.2 (in __dst_free),
  * we have _very_ good reasons not to repeat
  * this mistake in 2.3, but we have no choice
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot

2008-02-20 Thread Ilpo Järvinen

On Wed, 20 Feb 2008, Jan Engelhardt wrote:

 
 On Feb 20 2008 17:27, Patrick McHardy wrote:
  Striking. How can this even happen? A callsite which calls
  
   dev_alloc_skb(n)
  
  is just equivalent to
  
   __dev_alloc_skb(n, GFP_ATOMIC);
  
  which means there's like 4 (or 8 if it's long) bytes more on the
  stack. For a worst case, count in another 8 bytes for push and pop or mov 
  on
  the stack. But that still does not add up to 23 kb.

I think you misunderstood the results, if I uninlined dev_alloc_skb(), it 
_alone_ was uninlined which basically means that __dev_alloc_skb() that is 
inline as well is included inside that uninlined function.

When both were inlined, they add up to everywhere, and uninlining 
dev_alloc_skb alone mitigates that for both(!) of them in every place 
where dev_alloc_skb is being called. Because __dev_alloc_skb call sites 
are few, most benefits show up already with dev_alloc_skb uninlining 
alone. On the other hand, if __dev_alloc_skb is uninlined, the size 
reasoning you used above applies to dev_alloc_skb callsites, and that
is definately less than 23kB.

  __dev_alloc_skb() is also an inline function which performs
  some extra work. Which raises the question - if dev_alloc_skb()
  is uninlined, shouldn't __dev_alloc_skb() be uninline as well?

Of course that could be done as well, however, I wouldn't be too keen to 
deepen callchain by both of them, ie., uninlined dev_alloc_skb would just 
contain few bytes which perform the call to __dev_alloc_skb which has the 
bit larger content due to that extra work. IMHO the best solution would 
duplicate the extra work to both of them on binary level (obviously 
not on the source level), e.g., by adding static inline ___dev_alloc_skb() 
to .h which is inlined to both of the variants. I'm not too sure if inline 
to __dev_alloc_skb() alone is enough in .c file to result in inlining of 
__dev_alloc_skb to dev_alloc_skb (with all gcc versions and relevant 
optimization settings).

 I'd like to see the results when {__dev_alloc_skb is externed
 and dev_alloc_skb remains inlined}.

The results are right under your nose already... ;-)
See from the list of the series introduction:

  http://marc.info/?l=linux-netdevm=120351526210711w=2

IMHO more interesting number (which I currently don't have) is the 
_remaining_ benefits of uninlining __dev_alloc_skb after
dev_alloc_skb was first uninlined.


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 7/8] [SCTP]: uninline sctp_add_cmd_sf

2008-02-20 Thread Ilpo Järvinen

On Wed, 20 Feb 2008, Vlad Yasevich wrote:

 Ilpo Järvinen wrote:
  I added inline to sctp_add_cmd and appropriate comment there to
  avoid adding another call into the call chain. This works at least
  with gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13). Alternatively,
  __sctp_add_cmd could be introduced to .h.
  
 
 My only concern was performance regressions, but it looks like it
 doesn't effect anything from the few quick runs I've made.

There was one call made anyway, it's a bit hard to see how it would hurt 
to push that BUG() deeper down (in fact, this is one of the easiest case 
in this respect, many other cases elsewhere that need uninlining don't 
currently make any calls with inlines).

 Since we are putting sctp_add_cmd_sf() on the call stack, we might
 as well get rid of sctp_add_cmd() and reduce it a bit more.

IMHO it is definately better solution for archiving the size reduction,
I just didn't know before that the only sctp_add_cmd call could be 
converted.

[...snip...]
 diff --git a/net/sctp/command.c b/net/sctp/command.c
 index bb97733..3a06513 100644
 --- a/net/sctp/command.c
 +++ b/net/sctp/command.c
 @@ -51,19 +51,16 @@ int sctp_init_cmd_seq(sctp_cmd_seq_t *seq)
  
  /* Add a command to a sctp_cmd_seq_t.
   * Return 0 if the command sequence is full.
 + *
 + * Inline here is not a mistake, this way sctp_add_cmd_sf doesn't need extra
 + * calls, size penalty is of insignificant magnitude here

This won't be a necessary note anymore. :-)

[...snip...]

-- 
 i.

[PATCH] [XFRM] BEET: Remove extra semicolon after if

2008-02-02 Thread Ilpo Järvinen


Once again, one of this kind tries to creep in.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/xfrm4_mode_beet.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/xfrm4_mode_beet.c b/net/ipv4/xfrm4_mode_beet.c
index e093a7b..b47030b 100644
--- a/net/ipv4/xfrm4_mode_beet.c
+++ b/net/ipv4/xfrm4_mode_beet.c
@@ -102,7 +102,7 @@ static int xfrm4_beet_input(struct xfrm_state *x, struct 
sk_buff *skb)
 
XFRM_MODE_SKB_CB(skb)-protocol = ph-nexthdr;
 
-   if (!pskb_may_pull(skb, phlen));
+   if (!pskb_may_pull(skb, phlen))
goto out;
__skb_pull(skb, phlen);
}
-- 
1.5.2.2

[RFC PATCH]: TCP + SACK + skb processing latency

2008-02-01 Thread Ilpo Järvinen

Hi all,

Here's an attempt to reduce amount of skb cleanup work TCP with TSO has to 
do after SACKs have arrived.

I'm not on very familiar grounds with TSOed skbs so there likely is much I 
just couldn't take into account. I probably should at least check some 
flag somewhere? (=NETIF_F_SG?). Also the current the length handling and 
potential presence of headers worries me a lot.

I'd appreciate if somebody with more knowledge about skb internals could 
take a look at the relevant parts (mainly skb_shift) and confirm that 
doing all this I'm trying to do is allowed :-).


--
 i.


--
[RFC PATCH] [TCP]: Try to restore large SKBs while SACK processing

During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return to pre-split
state while more and more SACK info gets discovered.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/net/tcp.h|5 +
 net/ipv4/tcp_input.c |  206 +-
 2 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..cdf4468 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1170,6 +1170,11 @@ static inline struct sk_buff 
*tcp_write_queue_next(struct sock *sk, struct sk_bu
return skb-next;
 }
 
+static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct 
sk_buff *skb)
+{
+   return skb-prev;
+}
+
 #define tcp_for_write_queue(skb, sk)   \
for (skb = (sk)-sk_write_queue.next;   \
 (skb != (struct sk_buff *)(sk)-sk_write_queue);  \
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 19c449f..2cb7e86 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1206,6 +1206,8 @@ static int tcp_check_dsack(struct tcp_sock *tp, struct 
sk_buff *ack_skb,
  * aligned portion of it that matches. Therefore we might need to fragment
  * which may fail and creates some hassle (caller must handle error case
  * returns).
+ *
+ * FIXME: this could be merged to shift decision code
  */
 static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 u32 start_seq, u32 end_seq)
@@ -1322,6 +1324,206 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct 
sock *sk,
return flag;
 }
 
+/* Attempts to shift up to shiftlen worth of bytes from prev to skb.
+ * Returns number bytes shifted.
+ *
+ * TODO: in case the prev runs out of frag space, operation could be
+ * made to return with a partial result (would allow tighter packing).
+ */
+static int skb_shift(struct sk_buff *prev, struct sk_buff *skb,
+unsigned int shiftlen)
+{
+   int i, to, merge;
+   unsigned int todo;
+   struct skb_frag_struct *from, *fragto;
+
+   if (skb_cloned(skb) || skb_cloned(prev))
+   return 0;
+
+   todo = shiftlen;
+   i = 0;
+   from = skb_shinfo(skb)-frags[i];
+   to = skb_shinfo(prev)-nr_frags;
+
+   merge = to - 1;
+   if (!skb_can_coalesce(prev, merge + 1, from-page, from-page_offset))
+   merge = -1;
+   if (merge = 0) {
+   i++;
+   if (from-size = shiftlen)
+   goto onlymerge;
+   todo -= from-size;
+   }
+
+   /* Skip full, not-fitting skb to avoid expensive operations */
+   if ((shiftlen == skb-len) 
+   (skb_shinfo(skb)-nr_frags - merge)  (MAX_SKB_FRAGS - to))
+   return 0;
+
+   while (todo  (i  skb_shinfo(skb)-nr_frags)) {
+   if (to == MAX_SKB_FRAGS)
+   return 0;
+
+   from = skb_shinfo(skb)-frags[i];
+   fragto = skb_shinfo(prev)-frags[to];
+
+   if (todo = from-size) {
+   *fragto = *from;
+   todo -= from-size;
+   i++;
+   to++;
+
+   } else {
+   fragto-page = from-page;
+   get_page(fragto-page);
+   fragto-page_offset = from-page_offset;
+   fragto-size = todo;
+   from-page_offset += todo;
+   from-size -= todo;
+
+   to++;
+   break;
+   }
+   }
+   skb_shinfo(prev)-nr_frags = to;
+
+   /* Delayed, so that we don't have to undo it */
+   if (merge = 0) {
+onlymerge:
+   from = skb_shinfo(skb)-frags[0];
+   skb_shinfo(prev)-frags[merge].size += min(from-size, 
shiftlen);
+   put_page(from-page);
+   }
+
+   /* Reposition in the original skb */
+   to = 0;
+   while (i  skb_shinfo(skb)-nr_frags)
+   skb_shinfo(skb)-frags[to++] = skb_shinfo(skb

Re: [RFC PATCH]: TCP + SACK + skb processing latency

2008-02-01 Thread Ilpo Järvinen

On Fri, 1 Feb 2008, Ilpo Järvinen wrote:

 @@ -1322,6 +1324,206 @@ static int tcp_sacktag_one(struct sk_buff *skb, 
 struct sock *sk,
   return flag;
  }
  
 +/* Attempts to shift up to shiftlen worth of bytes from prev to skb.
 + * Returns number bytes shifted.
 + *
 + * TODO: in case the prev runs out of frag space, operation could be
 + * made to return with a partial result (would allow tighter packing).
 + */
 +static int skb_shift(struct sk_buff *prev, struct sk_buff *skb,
 +  unsigned int shiftlen)
 +{
 + int i, to, merge;
 + unsigned int todo;
 + struct skb_frag_struct *from, *fragto;
 +
 + if (skb_cloned(skb) || skb_cloned(prev))
 + return 0;
 +
 + todo = shiftlen;
 + i = 0;
 + from = skb_shinfo(skb)-frags[i];
 + to = skb_shinfo(prev)-nr_frags;
 +
 + merge = to - 1;
 + if (!skb_can_coalesce(prev, merge + 1, from-page, from-page_offset))
 + merge = -1;
 + if (merge = 0) {
 + i++;
 + if (from-size = shiftlen)
 + goto onlymerge;
 + todo -= from-size;
 + }
 +
 + /* Skip full, not-fitting skb to avoid expensive operations */
 + if ((shiftlen == skb-len) 
 + (skb_shinfo(skb)-nr_frags - merge)  (MAX_SKB_FRAGS - to))

Before somebody else notices it:
  s/merge/i/, a leftover from flag - idx conversion.

 + return 0;
 +
 + while (todo  (i  skb_shinfo(skb)-nr_frags)) {
 + if (to == MAX_SKB_FRAGS)
 + return 0;
 +


-- 
 i.

Re: [2.6 patch] unexport sysctl_tcp_tso_win_divisor

2008-01-30 Thread Ilpo Järvinen

On Wed, 30 Jan 2008, Adrian Bunk wrote:

 This patch removes the no longer used 
 EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor).
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

 ---
 4884e7997ba5f63f2efeaeead21ed2768fb3f4de 
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 89f0188..ed750f9 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -2564,5 +2564,4 @@ EXPORT_SYMBOL(tcp_connect);
  EXPORT_SYMBOL(tcp_make_synack);
  EXPORT_SYMBOL(tcp_simple_retransmit);
  EXPORT_SYMBOL(tcp_sync_mss);
 -EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor);
  EXPORT_SYMBOL(tcp_mtup_init);

...yes, results from the recent move of tcp_is_cwnd_limited() away from 
tcp.h.

Acked-by: Ilpo Järvinen [EMAIL PROTECTED]

-- 
 i.

[PATCH] [TIPC] Kill unused static inline (x5)

2008-01-25 Thread Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo=20J=E4rvinen?= [EMAIL PROTECTED]

All these static inlines are unused:

in_own_zone 1 (net/tipc/addr.h)
msg_dataoctet   1 (net/tipc/msg.h)
msg_direct  1 (include/net/tipc/tipc_msg.h)
msg_options 1 (include/net/tipc/tipc_msg.h)
tipc_nmap_get   1 (net/tipc/bcast.h)

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/net/tipc/tipc_msg.h |   16 
 net/tipc/addr.h |5 -
 net/tipc/bcast.h|   13 -
 net/tipc/msg.h  |5 -
 4 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/include/net/tipc/tipc_msg.h b/include/net/tipc/tipc_msg.h
index fb42eb7..2e159a8 100644
--- a/include/net/tipc/tipc_msg.h
+++ b/include/net/tipc/tipc_msg.h
@@ -130,11 +130,6 @@ static inline u32 msg_type(struct tipc_msg *m)
return msg_bits(m, 1, 29, 0x7);
 }
 
-static inline u32 msg_direct(struct tipc_msg *m)
-{
-   return (msg_type(m) == TIPC_DIRECT_MSG);
-}
-
 static inline u32 msg_named(struct tipc_msg *m)
 {
return (msg_type(m) == TIPC_NAMED_MSG);
@@ -207,17 +202,6 @@ static inline u32 msg_nameupper(struct tipc_msg *m)
return msg_word(m, 10);
 }
 
-static inline char *msg_options(struct tipc_msg *m, u32 *len)
-{
-   u32 pos = msg_bits(m, 1, 16, 0x7);
-
-   if (!pos)
-   return 0;
-   pos = (pos * 4) + 28;
-   *len = msg_hdr_sz(m) - pos;
-   return (char *)m-hdr[pos/4];
-}
-
 #endif
 
 #endif
diff --git a/net/tipc/addr.h b/net/tipc/addr.h
index e4bd533..3ba67e6 100644
--- a/net/tipc/addr.h
+++ b/net/tipc/addr.h
@@ -57,11 +57,6 @@ static inline int in_own_cluster(u32 addr)
return !((addr ^ tipc_own_addr)  12);
 }
 
-static inline int in_own_zone(u32 addr)
-{
-   return !((addr ^ tipc_own_addr)  24);
-}
-
 static inline int is_slave(u32 addr)
 {
return addr  0x800;
diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h
index f910ed2..a2416fa 100644
--- a/net/tipc/bcast.h
+++ b/net/tipc/bcast.h
@@ -74,19 +74,6 @@ extern char tipc_bclink_name[];
 
 
 /**
- * nmap_get - determine if node exists in a node map
- */
-
-static inline int tipc_nmap_get(struct node_map *nm_ptr, u32 node)
-{
-   int n = tipc_node(node);
-   int w = n / WSIZE;
-   int b = n % WSIZE;
-
-   return nm_ptr-map[w]  (1  b);
-}
-
-/**
  * nmap_add - add a node to a node map
  */
 
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index ce26598..e9ef6df 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -663,11 +663,6 @@ static inline void msg_set_remote_node(struct tipc_msg *m, 
u32 a)
msg_set_word(m, msg_hdr_sz(m)/4, a);
 }
 
-static inline int msg_dataoctet(struct tipc_msg *m, u32 pos)
-{
-   return(msg_data(m)[pos + 4] != 0);
-}
-
 static inline void msg_set_dataoctet(struct tipc_msg *m, u32 pos)
 {
msg_data(m)[pos + 4] = 1;
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-24 Thread Ilpo Järvinen

  [c0155659] ? trace_hardirqs_on+0xb9/0x130
  [c01059ca] common_interrupt+0x2e/0x34
  [c0103390] ? mwait_idle_with_hints+0x40/0x50
  [c01033a0] ? mwait_idle+0x0/0x20
  [c01033b2] mwait_idle+0x12/0x20
  [c0103141] cpu_idle+0x61/0x110
  [c04339fd] rest_init+0x5d/0x60
  [c05a47fa] start_kernel+0x1fa/0x260
  [c05a4190] ? unknown_bootoption+0x0/0x130
  ===
 ---[ end trace 14b601818e6903ac ]---

...But this no longer is, and even more, L: 5 is not valid state at this 
point all (should only happen if we went to RTO but it would reset S to 
zero with newreno):

 P: 5 L: 5 vs 5 S: 0 vs 3 w: 2044790889-2044796616 (0)
 TCP wq(s) l
 TCP wq(h) +++h+
 l5 s3 f0 p5 seq: su2044790889 hs2044795029 sn2044796616

Surprisingly, it was the first time the WARN_ON for left_out returned 
correct location. This also explains why the patch I sent to Krishna 
didn't print anything (it didn't end up into printing because I forgot 
to add L+SP check into to the state checking if).

...so please, could you (others than Denys) try this patch, it should 
solve the issue. And Denys, could you confirm (and if necessary double 
check) that the kernel you saw this similar problem with is the pure 
Linus' mainline, i.e., without any net-2.6.25 or mm bits please, if so, 
that problem persists. And anyway, there were some fackets_out related 
problems reported as well and this doesn't help for that but I think I've 
lost track of who was seeing it due to large number of reports :-), could 
somebody refresh my memory because I currently don't have time to dig it 
up from archives (at least on this week).


-- 
 i.

--
[PATCH] [TCP]: NewReno must count every skb while marking losses

NewReno should add cnt per skb (as with FACK) instead of depending
on SACKED_ACKED bits which won't be set with it at all.
Effectively, NewReno should always exists after the first
iteration anyway (or immediately if there's already head in
lost_out.

This was fixed earlier in net-2.6.25 but got reverted among other
stuff and I didn't notice that this is still necessary (actually
wasn't even considering this case while trying to figure out the
reports because I lived with different kind of code than it in
reality was).

This should solve the WARN_ONs in TCP code that as a result of
this triggered multiple times in every place we check for this
invariant.

Special thanks to Dave Young [EMAIL PROTECTED] and
Krishna Kumar2 [EMAIL PROTECTED] for trying with my debug
patches.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_input.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 295490e..aa409a5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2156,7 +2156,7 @@ static void tcp_mark_head_lost(struct sock *sk, int 
packets, int fast_rexmit)
tp-lost_skb_hint = skb;
tp-lost_cnt_hint = cnt;
 
-   if (tcp_is_fack(tp) ||
+   if (tcp_is_fack(tp) || tcp_is_reno(tp) ||
(TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_ACKED))
cnt += tcp_skb_pcount(skb);
 
-- 
1.5.2.2

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-24 Thread Ilpo Järvinen

On Thu, 24 Jan 2008, Ilpo Järvinen wrote:

 And anyway, there were some fackets_out related 
 problems reported as well and this doesn't help for that but I think I've 
 lost track of who was seeing it due to large number of reports :-), could 
 somebody refresh my memory because I currently don't have time to dig it 
 up from archives (at least on this week).

Here's the updated debug patch for net-2.6.25/mm for tracking 
fackets_out inconsistencies (it won't work for trees which don't
include net-2.6.25, mm does of course :-)).

I hope I got it into good shape this time to avoid spurious stacktraces 
but still maintaining 100% accuracy, but it's not a simple oneliner so I 
might have missed something... :-)

I'd suggest that people trying with this first apply the newreno fix of 
the previous mail to avoid already-fixed case from triggering.

-- 
 i.

--
[PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline)

---
 include/net/tcp.h |5 ++-
 net/ipv4/tcp_input.c  |   18 +++-
 net/ipv4/tcp_ipv4.c   |  127 +
 net/ipv4/tcp_output.c |   23 +++--
 4 files changed, 165 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..552aa71 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,9 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern void tcp_print_queue(struct sock *sk);
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +771,7 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp)  
tp-packets_out)
+#define tcp_verify_left_out(tp)tcp_verify_wq((struct sock *)tp)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 19c449f..c897c93 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1426,8 +1426,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb,
int first_sack_index;
 
if (!tp-sacked_out) {
-   if (WARN_ON(tp-fackets_out))
+   if (WARN_ON(tp-fackets_out)) {
+   tcp_verify_left_out(tp);
tp-fackets_out = 0;
+   }
tcp_highest_sack_reset(sk);
}
 
@@ -2136,6 +2138,8 @@ static void tcp_mark_head_lost(struct sock *sk, int 
packets, int fast_rexmit)
struct sk_buff *skb;
int cnt;
 
+   tcp_verify_left_out(tp);
+
BUG_TRAP(packets = tp-packets_out);
if (tp-lost_skb_hint) {
skb = tp-lost_skb_hint;
@@ -2501,6 +2505,8 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
(tcp_fackets_out(tp)  tp-reordering));
int fast_rexmit = 0;
 
+   tcp_verify_left_out(tp);
+
if (WARN_ON(!tp-packets_out  tp-sacked_out))
tp-sacked_out = 0;
if (WARN_ON(!tp-sacked_out  tp-fackets_out))
@@ -2645,6 +2651,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
if (do_lost || (tcp_is_fack(tp)  tcp_head_timedout(sk)))
tcp_update_scoreboard(sk, fast_rexmit);
tcp_cwnd_down(sk, flag);
+
+   WARN_ON(tcp_write_queue_head(sk) == NULL);
+   WARN_ON(!tp-packets_out);
+
tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2848,6 +2858,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   tcp_verify_left_out(tp);
+
if (skb  (TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_ACKED))
flag |= FLAG_SACK_RENEGING;
 
@@ -3175,6 +3187,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_fackets = tp-fackets_out;
prior_in_flight = tcp_packets_in_flight(tp);
 
+   tcp_verify_left_out(tp);
+
if (!(flag  FLAG_SLOWPATH)  after(ack, prior_snd_una)) {
/* Window is constant, pure forward advance.
 * No more checks are required.
@@ -3237,6 +3251,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
if ((flag  FLAG_FORWARD_PROGRESS) || !(flag  FLAG_NOT_DUP))
dst_confirm(sk-sk_dst_cache);
 
+   tcp_verify_left_out(tp);
+
return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9aea88b..e6e3ad5 100644
--- a/net/ipv4

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Dave Young wrote:

 On Jan 23, 2008 3:41 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote:
 
  On Tue, 22 Jan 2008, David Miller wrote:
 
   From: Dave Young [EMAIL PROTECTED]
   Date: Wed, 23 Jan 2008 09:44:30 +0800
  
On Jan 22, 2008 6:47 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote:
 [PATCH] [TCP]: debug S+L
   
Thanks, If there's new findings I will let you know.
  
   Thanks for helping with this bug Dave.
 
  I noticed btw that there thing might (is likely to) spuriously trigger at
  WARN_ON(sacked != tp-sacked_out); because those won't be equal when SACK
  is not enabled. If that does happen too often, I send a fixed patch for
  it, yet, the fact that I print print tp-rx_opt.sack_ok allows
  identification of those cases already as it's zero when SACK is not
  enabled.
 
  Just ask if you need the updated debug patch.
 
 Thanks,  please send, I would like to get it.

There you go. I fixed non-SACK case by adding tcp_is_sack checks there and 
also added two verifys to tcp_ack to see if there's corruption outside of 
TCP.

-- 
 i.

[PATCH] [TCP]: debug S+L

---
 include/net/tcp.h |8 +++-
 net/ipv4/tcp_input.c  |   10 +
 net/ipv4/tcp_ipv4.c   |  101 +
 net/ipv4/tcp_output.c |   21 +++---
 4 files changed, 133 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..0685035 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp)  
tp-packets_out)
+#define tcp_verify_left_out(tp)\
+   do { \
+   WARN_ON(tcp_left_out(tp)  tp-packets_out); \
+   tcp_verify_wq((struct sock *)tp); \
+   } while(0)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fa2c85c..cdacf70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
if (do_lost || (tcp_is_fack(tp)  tcp_head_timedout(sk)))
tcp_update_scoreboard(sk, fast_rexmit);
tcp_cwnd_down(sk, flag);
+
+   WARN_ON(tcp_write_queue_head(sk) == NULL);
+   WARN_ON(!tp-packets_out);
+
tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   tcp_verify_left_out(tp);
+
if (skb  (TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_ACKED))
flag |= FLAG_SACK_RENEGING;
 
@@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_fackets = tp-fackets_out;
prior_in_flight = tcp_packets_in_flight(tp);
 
+   tcp_verify_left_out(tp);
+
if (!(flag  FLAG_SLOWPATH)  after(ack, prior_snd_una)) {
/* Window is constant, pure forward advance.
 * No more checks are required.
@@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
if ((flag  FLAG_FORWARD_PROGRESS) || !(flag  FLAG_NOT_DUP))
dst_confirm(sk-sk_dst_cache);
 
+   tcp_verify_left_out(tp);
+
return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9aea88b..7e8ab40 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -108,6 +108,107 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
.lhash_wait  = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait),
 };
 
+void tcp_print_queue(struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct sk_buff *skb;
+   char s[50+1];
+   char h[50+1];
+   int idx = 0;
+   int i;
+
+   tcp_for_write_queue(skb, sk) {
+   if (skb == tcp_send_head(sk))
+   break;
+
+   for (i = 0; i  tcp_skb_pcount(skb); i++) {
+   if (TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_ACKED) {
+   s[idx] = 'S';
+   if (TCP_SKB_CB(skb)-sacked  TCPCB_LOST)
+   s[idx] = 'B';
+
+   } else if (TCP_SKB_CB(skb)-sacked

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Krishna Kumar2 wrote:

 David Miller [EMAIL PROTECTED] wrote on 01/23/2008 01:27:23 PM:
 
   iperf with multiple threads almost always gets these 4, *especially*
 when I
   do some batching :).
  
   static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int
 flag)
   {
 ...
 if (WARN_ON(!tp-sacked_out  tp-fackets_out))
   tp-fackets_out = 0;
 ...
   }
 
  Does this assertion show up first or do you get the other TCP
  ones first?  It might be important, in that if you get the
  others ones first that corrupted state might be what leads to
  this one.
 
 Hi Dave,
 
 I looked at my *old* messages file and found this assert (2506) was first
 to hit (atleast in
 two messages file). It hit 5 times, then I got a different one that I had
 not reported earlier:
 
 KERNEL: assertion (packets = tp-packets_out) failed at
 net/ipv4/tcp_input.c (2139)
 
 (though this was hidden in my report under the panic for tcp_input.c:2528.
 
 Then another two thousand times of the 2506 asserts.
 
 Today I installed the latest untouched kernel, rebooted system and got the
 following errors
 in sequence, but no 2506 errors (which I have always got when running
 batching in the last
 2-3 weeks):
 
 Jan 22 02:07:55 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:07:57 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:07:58 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:00 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:02 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:04 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:05 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
 Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:08 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:09 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:11 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:13 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:14 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
 Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
 Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
 Jan 22 02:08:17 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
 Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
 
 and so on for another 700 counts.

 The unique asserts are:
   1767: tcp_verify_left_out (from tcp_entry_frto)
   2169: tcp_verify_left_out (from tcp_mark_head_lost)
   2528: tcp_verify_left_out (from tcp_fastretrans_alert)
   3063: tcp_verify_left_out (from tcp_process_frto)
 (where 2169 seems to preceed any other asserts)

Once you get one of these, you'll get a large number of them, maybe I 
should just change it to WARN_ON_ONCE to stop confusing people with the 
rest.

 The other two asserts that I got only with batching are:
   2139: BUG_TRAP(packets = tp-packets_out); (in tcp_mark_head_lost)
   2506: WARN_ON(!tp-sacked_out  tp-fackets_out) (in
 tcp_fastretrans_alert)
 (where 2506 always seems to preceed any other asserts).

It's almost impossible to know which of these is the main cause and the 
first occuring due to reasons I'll not copy here. What a strange thing
that it has been super quiet on this front until now everybody is seeing 
it, could there be something unrelated to TCP which has broken it all 
recently?

Good thing is that you seem to be able to reproduce it were nicely.
Please try

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Krishna Kumar2 wrote:

 Hi Ilpo,
 
  It's almost impossible to know which of these is the main cause and the
  first occuring due to reasons I'll not copy here. What a strange thing
  that it has been super quiet on this front until now everybody is seeing
  it, could there be something unrelated to TCP which has broken it all
  recently?
 
 I have been getting this for atleast 3 weeks but I was quiet since those
 were kernels that I had modified.

Since you can easily reproduce it, lets just figure out what's causing it 
hard way, rather than digging the endless git-logs... :-)

  Good thing is that you seem to be able to reproduce it were nicely.
  Please try with this beauty below... Hopefully got it correctly matching
  to mainline, in contrast to version I sent Dave Y., there's some added
  candy which catches highest_sack corruption as well, at least it compiles
 
  already :-).
 
 There were couple of patch apply failures in .c files which I fixed by
 hand.
 But when compiling, I got these errors (I am using DM's 2.6.24-rc7 kernel,
 net-2.6.25.git):

Well, that's annoying, you didn't mention net-2.6.25 back then, it sure 
is incompatible with it like already the patch title said... :-)


 net/ipv4/tcp_output.c: In function 'tcp_push_one':
 net/ipv4/tcp_output.c:1573: error: 'tp' undeclared (first use in this
 function)
 net/ipv4/tcp_output.c:1573: error: (Each undeclared identifier is reported
 only once
 net/ipv4/tcp_output.c:1573: error: for each function it appears in.)

...This is either due to mismerge or due your modifications.

Lets re-iterate, compiled ok for me tcp_{ipv4,input,output}.o...


-- 
 i.

[PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline)

---
 include/net/tcp.h |8 +++-
 net/ipv4/tcp_input.c  |   10 +
 net/ipv4/tcp_ipv4.c   |  107 +
 net/ipv4/tcp_output.c |   21 +++---
 4 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..0685035 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp)  
tp-packets_out)
+#define tcp_verify_left_out(tp)\
+   do { \
+   WARN_ON(tcp_left_out(tp)  tp-packets_out); \
+   tcp_verify_wq((struct sock *)tp); \
+   } while(0)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fa2c85c..cdacf70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
if (do_lost || (tcp_is_fack(tp)  tcp_head_timedout(sk)))
tcp_update_scoreboard(sk, fast_rexmit);
tcp_cwnd_down(sk, flag);
+
+   WARN_ON(tcp_write_queue_head(sk) == NULL);
+   WARN_ON(!tp-packets_out);
+
tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   tcp_verify_left_out(tp);
+
if (skb  (TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_ACKED))
flag |= FLAG_SACK_RENEGING;
 
@@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_fackets = tp-fackets_out;
prior_in_flight = tcp_packets_in_flight(tp);
 
+   tcp_verify_left_out(tp);
+
if (!(flag  FLAG_SLOWPATH)  after(ack, prior_snd_una)) {
/* Window is constant, pure forward advance.
 * No more checks are required.
@@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
if ((flag  FLAG_FORWARD_PROGRESS) || !(flag  FLAG_NOT_DUP))
dst_confirm(sk-sk_dst_cache);
 
+   tcp_verify_left_out(tp);
+
return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9aea88b..c95682e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -108,6 +108,113 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
.lhash_wait  = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait),
 };
 
+void tcp_print_queue(struct sock *sk)

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Ilpo Järvinen wrote:

 On Wed, 23 Jan 2008, Dave Young wrote:

  On Jan 23, 2008 3:41 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote:

   On Tue, 22 Jan 2008, David Miller wrote:

From: Dave Young [EMAIL PROTECTED]
Date: Wed, 23 Jan 2008 09:44:30 +0800

 On Jan 22, 2008 6:47 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote:
  [PATCH] [TCP]: debug S+L

 Thanks, If there's new findings I will let you know.

Thanks for helping with this bug Dave.

   I noticed btw that there thing might (is likely to) spuriously trigger at
   WARN_ON(sacked != tp-sacked_out); because those won't be equal when SACK
   is not enabled. If that does happen too often, I send a fixed patch for
   it, yet, the fact that I print print tp-rx_opt.sack_ok allows
   identification of those cases already as it's zero when SACK is not
   enabled.

   Just ask if you need the updated debug patch.

  Thanks,  please send, I would like to get it.

 There you go. I fixed non-SACK case by adding tcp_is_sack checks there and 
 also added two verifys to tcp_ack to see if there's corruption outside of 
 TCP.

There's some discussion about a problem that is very likely the same as in 
here (sorry for not remembering to cc you in there due to rapid progress):

   http://marc.info/?t=12010717423r=1w=2

-- 
 i.

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Ilpo Järvinen wrote:

 On Wed, 23 Jan 2008, Krishna Kumar2 wrote:
 
  Hi Ilpo,
  
   It's almost impossible to know which of these is the main cause and the
   first occuring due to reasons I'll not copy here. What a strange thing
   that it has been super quiet on this front until now everybody is seeing
   it, could there be something unrelated to TCP which has broken it all
   recently?
  
  I have been getting this for atleast 3 weeks but I was quiet since those
  were kernels that I had modified.
 
 Since you can easily reproduce it, lets just figure out what's causing it 
 hard way, rather than digging the endless git-logs... :-)

Hmm, perhaps it could be something related to this (and some untested 
path somewhere which is now exposed):

commit 4a55b553f691abadaa63570dfc714e20913561c1
Author: Ilpo Järvinen [EMAIL PROTECTED]
Date:   Thu Dec 20 20:36:03 2007 -0800

[TCP]: Fix TSO deferring

Dave, what do you think? Wouldn't explain the one -rc only report though 
from Denys. Another one I'm a bit unsure of is this:

commit 757c32944b80fd95542bd66f06032ab773034d53
Author: Ilpo Järvinen [EMAIL PROTECTED]
Date:   Thu Jan 3 20:39:01 2008 -0800

[TCP]: Perform setting of common control fields in one place

-sacked field is cleared in tcp_retransmit_skb due to a subtle change, 
which might be buggy However, I find it rather unlikely that this 
would explain Kumar's case. Anyway, here's the one that reverts the 
problematic part of it. ...and this is net-2.6.25 as well so it won't 
solve Denys' case either.

-- 
 i.

--
[PATCH] [TCP]: Revert part of [TCP]: Perform setting of common control...

This commit reverts part of 757c32944b80fd95542bd66f06032ab773034d53
([TCP]: Perform setting of common control fields in one place)
because it's not valid to clear -sacked field like that without
additional precautions with counters, mostly lost_out, sacked_out
should not be set due to reneging which kicks in much earlier than
this.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_output.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 648340f..f6cbc1f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1887,10 +1887,12 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb)
(TCP_SKB_CB(skb)-flags  TCPCB_FLAG_FIN) 
tp-snd_una == (TCP_SKB_CB(skb)-end_seq - 1)) {
if (!pskb_trim(skb, 0)) {
-   /* Reuse, even though it does some unnecessary work */
-   tcp_init_nondata_skb(skb, TCP_SKB_CB(skb)-end_seq - 1,
-TCP_SKB_CB(skb)-flags);
+   TCP_SKB_CB(skb)-seq = TCP_SKB_CB(skb)-end_seq - 1;
+   skb_shinfo(skb)-gso_segs = 1;
+   skb_shinfo(skb)-gso_size = 0;
+   skb_shinfo(skb)-gso_type = 0;
skb-ip_summed = CHECKSUM_NONE;
+   skb-csum = 0;
}
}
 
-- 
1.5.2.2

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Krishna Kumar2 wrote:

 While running with this patch, I got these errors (pasted at the end
 of this mail).

I don't have a clue why it didn't go to the checking func (or it didn't 
print anything) but just had those WARN_ONs... Hopefully this is giving 
somewhat better input (applies on top of the other debug patch).

--
 i.

[PATCH] [TCP]: more debug

---
 include/net/tcp.h|3 ++-
 net/ipv4/tcp_input.c |9 -
 net/ipv4/tcp_ipv4.c  |   19 ++-
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0685035..129c3b1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,7 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern void tcp_print_queue(struct sock *sk);
 extern voidtcp_verify_wq(struct sock *sk);
 
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
@@ -772,8 +773,8 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 /* Use define here intentionally to get WARN_ON location shown at the caller */
 #define tcp_verify_left_out(tp)\
do { \
-   WARN_ON(tcp_left_out(tp)  tp-packets_out); \
tcp_verify_wq((struct sock *)tp); \
+   WARN_ON(tcp_left_out(tp)  tp-packets_out); \
} while(0)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cdacf70..295490e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2133,12 +2133,15 @@ static void tcp_verify_retransmit_hint(struct tcp_sock 
*tp, struct sk_buff *skb)
 static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct sk_buff *skb;
+   struct sk_buff *skb, *prev = NULL;
int cnt;
 
+   tcp_verify_left_out(tp);
+
BUG_TRAP(packets = tp-packets_out);
if (tp-lost_skb_hint) {
skb = tp-lost_skb_hint;
+   prev = skb;
cnt = tp-lost_cnt_hint;
} else {
skb = tcp_write_queue_head(sk);
@@ -2166,6 +2169,10 @@ static void tcp_mark_head_lost(struct sock *sk, int 
packets, int fast_rexmit)
tcp_verify_retransmit_hint(tp, skb);
}
}
+   if (tcp_left_out(tp)  tp-packets_out) {
+   printk(KERN_ERR Prev hint: %p, exit %p\n, prev, skb);
+   tcp_print_queue(sk);
+   }
tcp_verify_left_out(tp);
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c95682e..c2a88c5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -117,6 +117,15 @@ void tcp_print_queue(struct sock *sk)
int idx = 0;
int i;
 
+   i = 0;
+   tcp_for_write_queue(skb, sk) {
+   if (skb == tcp_send_head(sk))
+   printk(KERN_ERR head %u %p\n, i, skb);
+   else
+   printk(KERN_ERR skb %u %p\n, i, skb);
+   i++;
+   }
+
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
@@ -195,11 +204,6 @@ void tcp_verify_wq(struct sock *sk)
packets += tcp_skb_pcount(skb);
}
 
-   WARN_ON(lost != tp-lost_out);
-   WARN_ON(tcp_is_sack(tp)  (sacked != tp-sacked_out));
-   WARN_ON(packets != tp-packets_out);
-   WARN_ON(fackets != tp-fackets_out);
-
if ((lost != tp-lost_out) ||
(tcp_is_sack(tp)  (sacked != tp-sacked_out)) ||
(packets != tp-packets_out) ||
@@ -213,6 +217,11 @@ void tcp_verify_wq(struct sock *sk)
tp-rx_opt.sack_ok);
tcp_print_queue(sk);
}
+
+   WARN_ON(lost != tp-lost_out);
+   WARN_ON(tcp_is_sack(tp)  (sacked != tp-sacked_out));
+   WARN_ON(packets != tp-packets_out);
+   WARN_ON(fackets != tp-fackets_out);
 }
 
 static int tcp_v4_get_port(struct sock *sk, unsigned short snum)
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Ilpo Järvinen

On Tue, 22 Jan 2008, Dave Young wrote:

 On Jan 22, 2008 12:37 PM, Dave Young [EMAIL PROTECTED] wrote:
 
  On Jan 22, 2008 5:14 AM, Ilpo Järvinen [EMAIL PROTECTED] wrote:
  
   On Mon, 21 Jan 2008, Dave Young wrote:
  
Please see the kernel messages following,(trigged while using some qemu 
session)
BTW, seems there's some e100 error message as well.
   
PCI: Setting latency timer of device :00:1b.0 to 64
e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
ACPI: PCI Interrupt :03:08.0[A] - GSI 20 (level, low) - IRQ 20
modprobe:2331 conflicting cache attribute efaff000-efb0 
uncached-default
e100: :03:08.0: e100_probe: Cannot map device registers, aborting.
ACPI: PCI interrupt for device :03:08.0 disabled
e100: probe of :03:08.0 failed with error -12
eth0:  setting full-duplex.
[ cut here ]
WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150()
Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq 
snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse 
snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth 
rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 
evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc
Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4
 [c0132100] ? printk+0x0/0x20
 [c0131834] warn_on_slowpath+0x54/0x80
 [c03e8df8] ? ip_finish_output+0x128/0x2e0
 [c03e9527] ? ip_output+0xe7/0x100
 [c03e8a88] ? ip_local_out+0x18/0x20
 [c03e991c] ? ip_queue_xmit+0x3dc/0x470
 [c043641e] ? _spin_unlock_irqrestore+0x5e/0x70
 [c0186be1] ? check_pad_bytes+0x61/0x80
 [c03f6031] tcp_mark_head_lost+0x121/0x150
 [c03f60ac] tcp_update_scoreboard+0x4c/0x170
 [c03f6e0a] tcp_fastretrans_alert+0x48a/0x6b0
 [c03f7d93] tcp_ack+0x1b3/0x3a0
 [c03fa14b] tcp_rcv_established+0x3eb/0x710
 [c04015c5] tcp_v4_do_rcv+0xe5/0x100
 [c0401bbb] tcp_v4_rcv+0x5db/0x660
  
   Doh, once more these S+L things..., the rest are symptom of the first
   problem.
 
  What is the S+L thing? Could you explain a bit?

It means that one of the skbs is both SACKed and marked as LOST at the
same time in the counters (might be due to miscount of lost/sacked_out
too, not necessarilily in the -sacked bits). Such state is logically
invalid because it would mean that the sender thinks that the same packet 
both reached the receiver and is lost in the network.

Traditionally TCP has just silently corrected over-estimates
(sacked_out+lost_out  packets_out). I changed this couple of releases ago
because those over-estimates often are due to bugs that should be fixed  
(there have been couple of them but it has been very quite on this front  
long time, months or even half year already; but I might have broken
something with the early Dec changes).

These problem may originate from a bug that occurred a number of ACKs
earlier the WARN_ON triggered, therefore they are a bit tricky to track,
those WARN_ON serve just for alerting purposes and usually do not point
out where the bug actually occurred.

I usually just asked people to include exhaustive verifier which compares
-sacked bitmaps with sacked/lost_out counters and report immediately when
the problem shows up, rather than waiting for the cheaper S+L check we do
in the WARN_ON to trigger. I tried to collect tracking patch from the
previous efforts (hopefully got it right after modifications).

  I'm a bit worried about its
   reproducability if it takes this far to see it...
  
 
 It's trigged again in my pc, just while using firefox.

...Good, then there's some chance to catch it.

-- 
 i.

[PATCH] [TCP]: debug S+L

---
 include/net/tcp.h |8 +++-
 net/ipv4/tcp_input.c  |6 +++
 net/ipv4/tcp_ipv4.c   |  101 +
 net/ipv4/tcp_output.c |   21 +++---
 4 files changed, 129 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..0685035 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp)  
tp-packets_out)
+#define tcp_verify_left_out(tp)\
+   do { \
+   WARN_ON(tcp_left_out(tp)  tp

Re: WARNING, tcp_fastretrans_alert, rc6-git11

2008-01-22 Thread Ilpo Järvinen

On Tue, 22 Jan 2008, Denys Fedoryshchenko wrote:

 Just got on one of proxies, under high load.
 It is a bit old rc, so probably my report not interesting, but since it is 
 production machines, i cannot change too often.
 Kernel is 2.6.24-rc6-git11

It's not at all useless, there hasn't been any TCP changes lately so this 
is still very much valid.

 Some sysctl adjustments done. Please tell me if need more information.

TCP related sysctl setting would be nice to know (I don't care much about 
mem sizes but others tweaks affecting TCP features are nice to know).

...snip...
 [9561199.893090] WARNING: at net/ipv4/tcp_input.c:2391 tcp_fastretrans_alert()

This check includes fixing if there's miscount (couple of releases ago 
we didn't even care to print that but had inaccurate fackets_out as a 
feature, there still may be some left-overs though I've tried to track 
them down).

To find out cause more accurately (this WARN_ON is just for alerting), I'd 
need to add rather exhaustive searching per ACK which unlikely is 
acceptable for you. Luckily, Dave Young reported one problem lately 
(though he was running brand new mm) which could be caused by the same 
problem as this, so it's not that bad even though you couldn't run skb 
state verifying.

Thanks anyway for the report.

-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Ilpo Järvinen

On Tue, 22 Jan 2008, David Miller wrote:

 From: Dave Young [EMAIL PROTECTED]
 Date: Wed, 23 Jan 2008 09:44:30 +0800

  On Jan 22, 2008 6:47 PM, Ilpo Järvinen [EMAIL PROTECTED] wrote:
   [PATCH] [TCP]: debug S+L

  Thanks, If there's new findings I will let you know.

 Thanks for helping with this bug Dave.

I noticed btw that there thing might (is likely to) spuriously trigger at 
WARN_ON(sacked != tp-sacked_out); because those won't be equal when SACK 
is not enabled. If that does happen too often, I send a fixed patch for 
it, yet, the fact that I print print tp-rx_opt.sack_ok allows
identification of those cases already as it's zero when SACK is not 
enabled.

Just ask if you need the updated debug patch.

-- 
 i.

Re: [PATCH 2/4] dsmark: get rid of trivial function

2008-01-21 Thread Ilpo Järvinen

On Mon, 21 Jan 2008, Patrick McHardy wrote:

 Stephen Hemminger wrote:
  Replace loop in dsmark_valid_indices with equivalent bit math.
  
  Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
  
  --- a/net/sched/sch_dsmark.c2008-01-20 13:07:58.0 -0800
  +++ b/net/sched/sch_dsmark.c2008-01-20 13:22:54.0 -0800
  @@ -45,13 +45,8 @@ struct dsmark_qdisc_data {
   
   static inline int dsmark_valid_indices(u16 indices)
   {
  -   while (indices != 1) {
  -   if (indices  1)
  -   return 0;
  -   indices = 1;
  -   }
  -
  -   return 1;
  +   /* Must have only one bit set */
  +   return (indices  (indices - 1)) == 0;

Isn't there some magic under include/linux to do that btw, I suppose 
that if the caller side zero check is pushed down there too, the
is_power_of_2() is 100% match? :-) 

 hweight seems easier to understand, it took me a bit
 to realize that the comment matches the code :)

In addition, the original seems infinite loop with zero indices given
but luckily that was checked at the caller site already...


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-21 Thread Ilpo Järvinen

On Mon, 21 Jan 2008, Dave Young wrote:

 Please see the kernel messages following,(trigged while using some qemu 
 session)
 BTW, seems there's some e100 error message as well.
 
 PCI: Setting latency timer of device :00:1b.0 to 64
 e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI
 e100: Copyright(c) 1999-2006 Intel Corporation
 ACPI: PCI Interrupt :03:08.0[A] - GSI 20 (level, low) - IRQ 20
 modprobe:2331 conflicting cache attribute efaff000-efb0 uncached-default
 e100: :03:08.0: e100_probe: Cannot map device registers, aborting.
 ACPI: PCI interrupt for device :03:08.0 disabled
 e100: probe of :03:08.0 failed with error -12
 eth0:  setting full-duplex.
 [ cut here ]
 WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150()
 Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq 
 snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse snd_hda_intel 
 snd_pcm snd_timer btusb rtc_cmos thermal bluetooth rtc_core serio_raw 
 intel_agp button processor sg snd rtc_lib i2c_i801 evdev agpgart soundcore 
 dcdbas 3c59x pcspkr snd_page_alloc
 Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4
  [c0132100] ? printk+0x0/0x20
  [c0131834] warn_on_slowpath+0x54/0x80
  [c03e8df8] ? ip_finish_output+0x128/0x2e0
  [c03e9527] ? ip_output+0xe7/0x100
  [c03e8a88] ? ip_local_out+0x18/0x20
  [c03e991c] ? ip_queue_xmit+0x3dc/0x470
  [c043641e] ? _spin_unlock_irqrestore+0x5e/0x70
  [c0186be1] ? check_pad_bytes+0x61/0x80
  [c03f6031] tcp_mark_head_lost+0x121/0x150
  [c03f60ac] tcp_update_scoreboard+0x4c/0x170
  [c03f6e0a] tcp_fastretrans_alert+0x48a/0x6b0
  [c03f7d93] tcp_ack+0x1b3/0x3a0
  [c03fa14b] tcp_rcv_established+0x3eb/0x710
  [c04015c5] tcp_v4_do_rcv+0xe5/0x100
  [c0401bbb] tcp_v4_rcv+0x5db/0x660

Doh, once more these S+L things..., the rest are symptom of the first 
problem.

What is strange is that it doesn't show up until now, the last TCP
changes that could have some significance are from early Dec/Nov. Is
there some reason why you haven't seen this before this (e.g., not
tested with similar cfg or so)? I'm a bit worried about its
reproducability if it takes this far to see it...


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-14 Thread Ilpo Järvinen

On Sun, 13 Jan 2008, David Miller wrote:

 From: Ilpo_Järvinen [EMAIL PROTECTED]
 Date: Mon, 14 Jan 2008 09:43:08 +0200 (EET)

  I'd prefer sending them as iso-8859-1 compliant (and I guess you are able 
  to test your fix-to-utf-8 machinery with it as well :-)), as it would also 
  make my mails compatible with other people's git apply tools you're not 
  using (otherwise I'd probably forget to change it occassionally when 
  interacting with others than you).

 For now either way is fine with me.  If the situation changes I'll
 let you know.

Ok, I'll remain in iso-8859-1, it's something that is known to work from 
my end. Thanks anyway for fixing it, wasn't any big deal for me at any 
point of time but people started asking me privately to correct it which 
I of course couldn't... :-)

 I'm surprised git-send-email can't get it purely utf8 correctly.

The problem is that my system is ISO natively, so git-send-email might 
encode a ISO native .patch file's content while sending, which in this 
case was intentionally already utf-8.

It might surprise you but it wasn't a long time ago when git-send-email 
wouldn't care less e.g. about header encoding and I got rejects from 
netdev due to my name which wasn't encoded properly, I've 1.5.0.6 
currently and it seemed still fail to encode Cc addresses it adds from 
signed-offs unless I explicitly ask for it to not do that (I explicitly 
ask for especific, encoded, from header anyway because it was broken at 
some point of time and my sending template is copy-paste originating
from that time). There was some recent fixes in the git's logs regarding 
that encoding, so I intend to check if a later g-s-e is more able and if 
it isn't I'll report it to git folks.

 I wonder if there is some issue with how it gets your name
 string for the commit author etc.

I've had it working well since the encoding header got relatively recently 
added (wasn't available at early dawn of git era), before that it was just 
a mess locally. Funny enough, you were able to magically mangle my emails 
to utf-8'ed commits nicely back then so I got a fixed commit back after 
an RTT :-).

 I wonder if getting it into your global GIT config file in proper
 UTF8 encoding would fix things.

 Put something like this into ~/.gitconfig

 [user]
   name = Ilpo Järvinen
   email = [EMAIL PROTECTED]

I have this. In addition I have this (required to make my local system 
consistent):

[i18n]
commitencoding = ISO-8859-1

The problem was just that the API (or better, ABI) between us 
wasn't properly working :-)). While Herbert was working as the
replacement-Dave in November, correct commit entries were created,
so git has been working fine (I guess he used git applying tools
instead of handmade scripts and they handle email correclt based
on it's encoding).

I tried logOutputEncoding = utf-8 in the last patch sets I sent (now 
could again remove it) but git-send-email problem appeared with it
because the system is ISO natively.

 The GIT maintainer is Finnish which makes this situation even
 more perplexing to me, you might want to discuss it with him :-)

Junio? Never heard that a Finnish name... ;-) Perhaps
git-send-email wasn't written by that Finnish guy... :-) 
...Besides, that Finnish git aintainer doesn't have any funny
characters in his name... ;-)

Thanks anyway for the tips  all, I think we have it now working
and I can return to inlines and rexmit_skb_hint things  other TCP
stuff rather than this hinderance. I've some interesting results
from net header inlines checks I ran overnight :-).

-- 
 i.

Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-14 Thread Ilpo Järvinen

On Fri, 11 Jan 2008, Zhang, Yanmin wrote:

 On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: 
  The regression is:
  1)stoakley with 2 qual-core processors: 11%;
  2)Tulsa with 4 dual-core(+hyperThread) processors:13%;
 I have new update on this issue and also cc to netdev maillist.
 Thank David Miller for pointing me the netdev maillist.
 
  
  The test command is:
  #sudo taskset -c 7 ./netserver
  #sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 
  -- -r 1,1
  
  As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's
  regression is between 16%~11%.
  
  I tried to use bisect to locate the bad patch between 2.6.22 and 2.6.23-rc1,
  but the bisected kernel wasn't stable and went crazy.

TCP work between that is very much non-existing.

Using git-reset's to select a nearby merge point instead of default 
commit where bisection lands might be help in case the bisected kernel 
breaks.

Also, limiting bisection under a subsystem might reduce probability of 
brokeness (might at least be able to narrow it down quite a lot), e.g.

git bisect start net/


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-14 Thread Ilpo Järvinen

On Mon, 14 Jan 2008, Ilpo Järvinen wrote:

 On Fri, 11 Jan 2008, Zhang, Yanmin wrote:
 
  On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: 
   
   As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's
   regression is between 16%~11%.
   
   I tried to use bisect to locate the bad patch between 2.6.22 and 
   2.6.23-rc1,
   but the bisected kernel wasn't stable and went crazy.
 
 TCP work between that is very much non-existing.

I _really_ meant 2.6.22 - 2.6.23-rc1, not 2.6.24-rc1 in case you had a 
typo there which is not that uncommon while typing kernel versions... :-)

-- 
 i.

Re: [PATCH 1/8] [TCP]: Uninline tcp_set_state

2008-01-13 Thread Ilpo Järvinen

On Sat, 12 Jan 2008, Stephen Hemminger wrote:

 On Sat, 12 Jan 2008 11:40:10 +0200
 Ilpo Järvinen [EMAIL PROTECTED] wrote:

...snip...

  built-in.o:
   12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445

...snip...

   include/net/tcp.h |   35 +--
   net/ipv4/tcp.c|   35 +++
   2 files changed, 36 insertions(+), 34 deletions(-)
  
  diff --git a/include/net/tcp.h b/include/net/tcp.h
  index 48081ad..306580c 100644
  --- a/include/net/tcp.h
  +++ b/include/net/tcp.h
  @@ -926,40 +926,7 @@ static const char *statename[]={
  Close Wait,Last ACK,Listen,Closing
   };
   #endif
  -
  -static inline void tcp_set_state(struct sock *sk, int state)
  -{
  -   int oldstate = sk-sk_state;
  -
  -   switch (state) {
  -   case TCP_ESTABLISHED:
  -   if (oldstate != TCP_ESTABLISHED)
  -   TCP_INC_STATS(TCP_MIB_CURRESTAB);
  -   break;
  -
  -   case TCP_CLOSE:
  -   if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
  -   TCP_INC_STATS(TCP_MIB_ESTABRESETS);
  -
  -   sk-sk_prot-unhash(sk);
  -   if (inet_csk(sk)-icsk_bind_hash 
  -   !(sk-sk_userlocks  SOCK_BINDPORT_LOCK))
  -   inet_put_port(tcp_hashinfo, sk);
  -   /* fall through */
  -   default:
  -   if (oldstate==TCP_ESTABLISHED)
  -   TCP_DEC_STATS(TCP_MIB_CURRESTAB);
  -   }
  -
  -   /* Change state AFTER socket is unhashed to avoid closed
  -* socket sitting in hash tables.
  -*/
  -   sk-sk_state = state;
  -
  -#ifdef STATE_TRACE
  -   SOCK_DEBUG(sk, TCP sk=%p, State %s - %s\n,sk, 
  statename[oldstate],statename[state]);
  -#endif 
  -}
 
 
 
 Since the function is called with a constant state, I guess the assumption
 was that gcc would be smart enough to only include the code needed, it looks 
 like
 either code was bigger or the compiler was dumber than expected

I'd guess that compiler was just dumber... :-) It might be an interesting 
experiment to convert it to if's and see if it would make a difference, 
maybe it just gets confused by the switch or something.

Besides, it not always that obvious that gcc is able to determine the
constant state, considering e.g., the complexity in the cases with
tcp_rcv_synsent_state_process, tcp_close_state, tcp_done. In such cases
uninlining should be done and gcc is probably not able to mix both cases
nicely for a single function?

However, after looking a bit, I'm partially leaning towards the other 
option too:

tcp_done| -145
tcp_disconnect  | -141

...These called for tcp_set_state just _once_, while this calls for it 
twice:

tcp_fin   |  -86

...Obviously the compiler was able to perform some reductions but 43 bytes 
per inlining seems still a bit high number.


-- 
 i.

Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-13 Thread Ilpo Järvinen

On Sat, 12 Jan 2008, David Miller wrote:

 From: Ilpo_Järvinen [EMAIL PROTECTED]
 Date: Sat, 12 Jan 2008 14:59:50 +0200 (EET)

  ...Maybe I just fall-back to changing my last name, it's the only 
  full-proof solution... ;-)

 Don't do this!  Otherwise I won't have a frequent test case to make
 sure my patch applying scripts are working properly. :-

So which test case you prefer? :-) Is iso-8859-1 from+content ok? Or 
should I keep trying to live with mixed utf-8 which I didn't got even 
fully working last time because git-send-email is probably either too dumb 
or too intelligent (I'm not even sure which), but you were able correct it 
by your tools so the flawed signed-off never entered to the git logs as 
incorrectly formatted :-).

I'd prefer sending them as iso-8859-1 compliant (and I guess you are able 
to test your fix-to-utf-8 machinery with it as well :-)), as it would also 
make my mails compatible with other people's git apply tools you're not 
using (otherwise I'd probably forget to change it occassionally when 
interacting with others than you).

-- 
 i.

[PATCH net-2.6.25 0/8] [NET]: More uninlining related

2008-01-12 Thread Ilpo Järvinen

Hi Dave,

First of all, I changed output encoding of git to utf-8, so I
guess the encoding should not cause the same trouble for you.

Here are couple of more to uninline things. Pretty
straightforward except the EXPORT_SYMBOLs, I've no idea which
is the right variant (feel free to fix them while applying :-)).
Also pktgen uninlining is somewhat questionable because it's
just a testing tool so feel free to drop it if it feels
unnecessary (I could have asked first but it's just as easy to
do it this way if not easier)...

There were more dead static inlines found after inlines removed
(gcc didn't report them otherwise) than in pktgen now included,
but I'm not sure if I should send by default patches removing
or #if 0'ing them?


--
 i.



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25 0/8] [NET]: More uninlining related

2008-01-12 Thread Ilpo Järvinen

Hi Dave,

First of all, I changed output encoding to utf-8, so I guess
the encoding should not cause trouble for you.

Here are couple of more to uninline things. Pretty
straightforward except the EXPORT_SYMBOLs, I've no idea which
is the right variant (feel free to fix them while applying :-)).
Also pktgen uninlining is somewhat questionable because it's
just a testing tool so feel free to drop it if it feels
unnecessary (I could have asked first but it's just as easy to
do it this way if not easier)...

There were more dead static inlines found after inlines removed
(gcc didn't report them otherwise) than in pktgen now included,
but I'm not sure if I should send by default patches removing
or #if 0'ing them?


--
 i.

ps. I apologize that I must resend to get them to netdev as well
because git-send-email of this system (not sure if later could)
still seems to be lacking proper encoding of my name when it
decides to add it to Cc list all by itself and those 8-bit chars
in address got rejected.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/8] [TCP]: Uninline tcp_is_cwnd_limited

2008-01-12 Thread Ilpo Järvinen

net/ipv4/tcp_cong.c:
  tcp_reno_cong_avoid |  -65
 1 function changed, 65 bytes removed, diff: -65

net/ipv4/arp.c:
  arp_ignore |   -5
 1 function changed, 5 bytes removed, diff: -5

net/ipv4/tcp_bic.c:
  bictcp_cong_avoid |  -57
 1 function changed, 57 bytes removed, diff: -57

net/ipv4/tcp_cubic.c:
  bictcp_cong_avoid |  -61
 1 function changed, 61 bytes removed, diff: -61

net/ipv4/tcp_highspeed.c:
  hstcp_cong_avoid |  -63
 1 function changed, 63 bytes removed, diff: -63

net/ipv4/tcp_hybla.c:
  hybla_cong_avoid |  -85
 1 function changed, 85 bytes removed, diff: -85

net/ipv4/tcp_htcp.c:
  htcp_cong_avoid |  -57
 1 function changed, 57 bytes removed, diff: -57

net/ipv4/tcp_veno.c:
  tcp_veno_cong_avoid |  -52
 1 function changed, 52 bytes removed, diff: -52

net/ipv4/tcp_scalable.c:
  tcp_scalable_cong_avoid |  -61
 1 function changed, 61 bytes removed, diff: -61

net/ipv4/tcp_yeah.c:
  tcp_yeah_cong_avoid |  -75
 1 function changed, 75 bytes removed, diff: -75

net/ipv4/tcp_illinois.c:
  tcp_illinois_cong_avoid |  -54
 1 function changed, 54 bytes removed, diff: -54

net/dccp/ccids/ccid3.c:
  ccid3_update_send_interval |   -7
  ccid3_hc_tx_packet_recv|   +7
 2 functions changed, 7 bytes added, 7 bytes removed, diff: +0

net/ipv4/tcp_cong.c:
  tcp_is_cwnd_limited |  +88
 1 function changed, 88 bytes added, diff: +88

built-in.o:
 14 functions changed, 95 bytes added, 642 bytes removed, diff: -547

...Again some gcc artifacts visible as well.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/net/tcp.h   |   22 +-
 net/ipv4/tcp_cong.c |   21 +
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 306580c..7de4ea3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -786,27 +786,7 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
 {
return tp-snd_una + tp-snd_wnd;
 }
-
-/* RFC2861 Check whether we are limited by application or congestion window
- * This is the inverse of cwnd check in tcp_tso_should_defer
- */
-static inline int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
-{
-   const struct tcp_sock *tp = tcp_sk(sk);
-   u32 left;
-
-   if (in_flight = tp-snd_cwnd)
-   return 1;
-
-   if (!sk_can_gso(sk))
-   return 0;
-
-   left = tp-snd_cwnd - in_flight;
-   if (sysctl_tcp_tso_win_divisor)
-   return left * sysctl_tcp_tso_win_divisor  tp-snd_cwnd;
-   else
-   return left = tcp_max_burst(tp);
-}
+extern int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight);
 
 static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
   const struct sk_buff *skb)
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 4451750..3a6be23 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -274,6 +274,27 @@ int tcp_set_congestion_control(struct sock *sk, const char 
*name)
return err;
 }
 
+/* RFC2861 Check whether we are limited by application or congestion window
+ * This is the inverse of cwnd check in tcp_tso_should_defer
+ */
+int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+   u32 left;
+
+   if (in_flight = tp-snd_cwnd)
+   return 1;
+
+   if (!sk_can_gso(sk))
+   return 0;
+
+   left = tp-snd_cwnd - in_flight;
+   if (sysctl_tcp_tso_win_divisor)
+   return left * sysctl_tcp_tso_win_divisor  tp-snd_cwnd;
+   else
+   return left = tcp_max_burst(tp);
+}
+EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited);
 
 /*
  * Slow start is used when congestion window is less than slow start
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-12 Thread Ilpo Järvinen

net/core/pktgen.c:
  pktgen_stop_device   |  -50
  pktgen_run   | -105
  pktgen_if_show   |  -37
  pktgen_thread_worker | -702
 4 functions changed, 894 bytes removed, diff: -894

net/core/pktgen.c:
  getCurUs |  +36
 1 function changed, 36 bytes added, diff: +36

net/core/pktgen.o:
 5 functions changed, 36 bytes added, 894 bytes removed, diff: -858

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/core/pktgen.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index ebfb126..d18fdb1 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -405,7 +405,7 @@ static inline __u64 tv_to_us(const struct timeval *tv)
return us;
 }
 
-static inline __u64 getCurUs(void)
+static __u64 getCurUs(void)
 {
struct timeval tv;
do_gettimeofday(tv);
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/8] [TCP]: Uninline tcp_set_state

2008-01-12 Thread Ilpo Järvinen

net/ipv4/tcp.c:
  tcp_close_state | -226
  tcp_done| -145
  tcp_close   | -564
  tcp_disconnect  | -141
 4 functions changed, 1076 bytes removed, diff: -1076

net/ipv4/tcp_input.c:
  tcp_fin   |  -86
  tcp_rcv_state_process | -164
 2 functions changed, 250 bytes removed, diff: -250

net/ipv4/tcp_ipv4.c:
  tcp_v4_connect | -209
 1 function changed, 209 bytes removed, diff: -209

net/ipv4/arp.c:
  arp_ignore |   +5
 1 function changed, 5 bytes added, diff: +5

net/ipv6/tcp_ipv6.c:
  tcp_v6_connect | -158
 1 function changed, 158 bytes removed, diff: -158

net/sunrpc/xprtsock.c:
  xs_sendpages |   -2
 1 function changed, 2 bytes removed, diff: -2

net/dccp/ccids/ccid3.c:
  ccid3_update_send_interval |   +7
 1 function changed, 7 bytes added, diff: +7

net/ipv4/tcp.c:
  tcp_set_state | +238
 1 function changed, 238 bytes added, diff: +238

built-in.o:
 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445

I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
Cc: Andi Kleen [EMAIL PROTECTED]
---
 include/net/tcp.h |   35 +--
 net/ipv4/tcp.c|   35 +++
 2 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 48081ad..306580c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -926,40 +926,7 @@ static const char *statename[]={
Close Wait,Last ACK,Listen,Closing
 };
 #endif
-
-static inline void tcp_set_state(struct sock *sk, int state)
-{
-   int oldstate = sk-sk_state;
-
-   switch (state) {
-   case TCP_ESTABLISHED:
-   if (oldstate != TCP_ESTABLISHED)
-   TCP_INC_STATS(TCP_MIB_CURRESTAB);
-   break;
-
-   case TCP_CLOSE:
-   if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
-   TCP_INC_STATS(TCP_MIB_ESTABRESETS);
-
-   sk-sk_prot-unhash(sk);
-   if (inet_csk(sk)-icsk_bind_hash 
-   !(sk-sk_userlocks  SOCK_BINDPORT_LOCK))
-   inet_put_port(tcp_hashinfo, sk);
-   /* fall through */
-   default:
-   if (oldstate==TCP_ESTABLISHED)
-   TCP_DEC_STATS(TCP_MIB_CURRESTAB);
-   }
-
-   /* Change state AFTER socket is unhashed to avoid closed
-* socket sitting in hash tables.
-*/
-   sk-sk_state = state;
-
-#ifdef STATE_TRACE
-   SOCK_DEBUG(sk, TCP sk=%p, State %s - %s\n,sk, 
statename[oldstate],statename[state]);
-#endif 
-}
+extern void tcp_set_state(struct sock *sk, int state);
 
 extern void tcp_done(struct sock *sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34085e3..7d7b958 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1652,6 +1652,41 @@ recv_urg:
goto out;
 }
 
+void tcp_set_state(struct sock *sk, int state)
+{
+   int oldstate = sk-sk_state;
+
+   switch (state) {
+   case TCP_ESTABLISHED:
+   if (oldstate != TCP_ESTABLISHED)
+   TCP_INC_STATS(TCP_MIB_CURRESTAB);
+   break;
+
+   case TCP_CLOSE:
+   if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
+   TCP_INC_STATS(TCP_MIB_ESTABRESETS);
+
+   sk-sk_prot-unhash(sk);
+   if (inet_csk(sk)-icsk_bind_hash 
+   !(sk-sk_userlocks  SOCK_BINDPORT_LOCK))
+   inet_put_port(tcp_hashinfo, sk);
+   /* fall through */
+   default:
+   if (oldstate==TCP_ESTABLISHED)
+   TCP_DEC_STATS(TCP_MIB_CURRESTAB);
+   }
+
+   /* Change state AFTER socket is unhashed to avoid closed
+* socket sitting in hash tables.
+*/
+   sk-sk_state = state;
+
+#ifdef STATE_TRACE
+   SOCK_DEBUG(sk, TCP sk=%p, State %s - %s\n,sk, 
statename[oldstate],statename[state]);
+#endif 
+}
+EXPORT_SYMBOL_GPL(tcp_set_state);
+
 /*
  * State processing on a close. This implements the state shift for
  * sending our FIN frame. Note that we only send a FIN for some
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/8] [IPV6] route: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/ipv6/route.c:
  ip6_pkt_prohibit_out | -130
  ip6_pkt_discard  | -261
  ip6_pkt_discard_out  | -130
  ip6_pkt_prohibit | -261
 4 functions changed, 782 bytes removed, diff: -782

net/ipv6/route.c:
  ip6_pkt_drop | +300
 1 function changed, 300 bytes added, diff: +300

net/ipv6/route.o:
 5 functions changed, 300 bytes added, 782 bytes removed, diff: -482

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv6/route.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 02354a7..d3b5811 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1762,8 +1762,7 @@ int ipv6_route_ioctl(unsigned int cmd, void __user *arg)
  * Drop the packet on the floor
  */
 
-static inline int ip6_pkt_drop(struct sk_buff *skb, int code,
-  int ipstats_mib_noroutes)
+static int ip6_pkt_drop(struct sk_buff *skb, int code, int 
ipstats_mib_noroutes)
 {
int type;
switch (ipstats_mib_noroutes) {
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/8] [NETLINK] af_netlink: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/netlink/af_netlink.c:
  netlink_realloc_groups|  -46
  netlink_insert|  -49
  netlink_autobind  |  -94
  netlink_clear_multicast_users |  -48
  netlink_bind  |  -55
  netlink_setsockopt|  -54
  netlink_release   |  -86
  netlink_kernel_create |  -47
  netlink_change_ngroups|  -56
 9 functions changed, 535 bytes removed, diff: -535

net/netlink/af_netlink.c:
  netlink_table_ungrab |  +53
 1 function changed, 53 bytes added, diff: +53

net/netlink/af_netlink.o:
 10 functions changed, 53 bytes added, 535 bytes removed, diff: -482

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/netlink/af_netlink.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index be07f1b..21f9e30 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -193,7 +193,7 @@ static void netlink_table_grab(void)
}
 }
 
-static inline void netlink_table_ungrab(void)
+static void netlink_table_ungrab(void)
__releases(nl_table_lock)
 {
write_unlock_irq(nl_table_lock);
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/8] [PKTGEN]: Kill dead static inlines

2008-01-12 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/core/pktgen.c |   94 -
 1 files changed, 0 insertions(+), 94 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index ede1fea..ebfb126 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -397,62 +397,6 @@ struct pktgen_thread {
 #define REMOVE 1
 #define FIND   0
 
-/*  This code works around the fact that do_div cannot handle two 64-bit
-numbers, and regular 64-bit division doesn't work on x86 kernels.
---Ben
-*/
-
-#define PG_DIV 0
-
-/* This was emailed to LMKL by: Chris Caputo [EMAIL PROTECTED]
- * Function copied/adapted/optimized from:
- *
- *  nemesis.sourceforge.net/browse/lib/static/intmath/ix86/intmath.c.html
- *
- * Copyright 1994, University of Cambridge Computer Laboratory
- * All Rights Reserved.
- *
- */
-static inline s64 divremdi3(s64 x, s64 y, int type)
-{
-   u64 a = (x  0) ? -x : x;
-   u64 b = (y  0) ? -y : y;
-   u64 res = 0, d = 1;
-
-   if (b  0) {
-   while (b  a) {
-   b = 1;
-   d = 1;
-   }
-   }
-
-   do {
-   if (a = b) {
-   a -= b;
-   res += d;
-   }
-   b = 1;
-   d = 1;
-   }
-   while (d);
-
-   if (PG_DIV == type) {
-   return (((x ^ y)  (1ll  63)) == 0) ? res : -(s64) res;
-   } else {
-   return ((x  (1ll  63)) == 0) ? a : -(s64) a;
-   }
-}
-
-/* End of hacks to deal with 64-bit math on x86 */
-
-/** Convert to milliseconds */
-static inline __u64 tv_to_ms(const struct timeval *tv)
-{
-   __u64 ms = tv-tv_usec / 1000;
-   ms += (__u64) tv-tv_sec * (__u64) 1000;
-   return ms;
-}
-
 /** Convert to micro-seconds */
 static inline __u64 tv_to_us(const struct timeval *tv)
 {
@@ -461,39 +405,6 @@ static inline __u64 tv_to_us(const struct timeval *tv)
return us;
 }
 
-static inline __u64 pg_div(__u64 n, __u32 base)
-{
-   __u64 tmp = n;
-   do_div(tmp, base);
-   /* printk(pktgen: pg_div, n: %llu  base: %d  rv: %llu\n,
-  n, base, tmp); */
-   return tmp;
-}
-
-static inline __u64 pg_div64(__u64 n, __u64 base)
-{
-   __u64 tmp = n;
-/*
- * How do we know if the architecture we are running on
- * supports division with 64 bit base?
- *
- */
-#if defined(__sparc_v9__) || defined(__powerpc64__) || defined(__alpha__) || 
defined(__x86_64__) || defined(__ia64__)
-
-   do_div(tmp, base);
-#else
-   tmp = divremdi3(n, base, PG_DIV);
-#endif
-   return tmp;
-}
-
-static inline __u64 getCurMs(void)
-{
-   struct timeval tv;
-   do_gettimeofday(tv);
-   return tv_to_ms(tv);
-}
-
 static inline __u64 getCurUs(void)
 {
struct timeval tv;
@@ -501,11 +412,6 @@ static inline __u64 getCurUs(void)
return tv_to_us(tv);
 }
 
-static inline __u64 tv_diff(const struct timeval *a, const struct timeval *b)
-{
-   return tv_to_us(a) - tv_to_us(b);
-}
-
 /* old include end */
 
 static char version[] __initdata = VERSION;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/8] [NETFILTER] xt_policy.c: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/netfilter/xt_policy.c:
  policy_mt | -906
 1 function changed, 906 bytes removed, diff: -906

net/netfilter/xt_policy.c:
  match_xfrm_state | +427
 1 function changed, 427 bytes added, diff: +427

net/netfilter/xt_policy.o:
 2 functions changed, 427 bytes added, 906 bytes removed, diff: -479

Alternatively, this could be done by combining identical
parts of the match_policy_in/out()

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/netfilter/xt_policy.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c
index 46ee7e8..45731ca 100644
--- a/net/netfilter/xt_policy.c
+++ b/net/netfilter/xt_policy.c
@@ -33,7 +33,7 @@ xt_addr_cmp(const union xt_policy_addr *a1, const union 
xt_policy_addr *m,
return false;
 }
 
-static inline bool
+static bool
 match_xfrm_state(const struct xfrm_state *x, const struct xt_policy_elem *e,
 unsigned short family)
 {
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/8] [XFRM] xfrm_policy: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/xfrm/xfrm_policy.c:
  xfrm_audit_policy_delete | -692
  xfrm_audit_policy_add| -692
 2 functions changed, 1384 bytes removed, diff: -1384

net/xfrm/xfrm_policy.c:
  xfrm_audit_common_policyinfo | +704
 1 function changed, 704 bytes added, diff: +704

net/xfrm/xfrm_policy.o:
 3 functions changed, 704 bytes added, 1384 bytes removed, diff: -680

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/xfrm/xfrm_policy.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 534b29e..47219f9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2367,8 +2367,8 @@ void __init xfrm_init(void)
 }
 
 #ifdef CONFIG_AUDITSYSCALL
-static inline void xfrm_audit_common_policyinfo(struct xfrm_policy *xp,
-   struct audit_buffer *audit_buf)
+static void xfrm_audit_common_policyinfo(struct xfrm_policy *xp,
+struct audit_buffer *audit_buf)
 {
struct xfrm_sec_ctx *ctx = xp-security;
struct xfrm_selector *sel = xp-selector;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [HTB]: htb_classid is dead static inline

2008-01-12 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/sched/sch_htb.c |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 72beb66..6a2352c 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -214,10 +214,6 @@ static inline struct htb_class *htb_find(u32 handle, 
struct Qdisc *sch)
  * then finish and return direct queue.
  */
 #define HTB_DIRECT (struct htb_class*)-1
-static inline u32 htb_classid(struct htb_class *cl)
-{
-   return (cl  cl != HTB_DIRECT) ? cl-classid : TC_H_UNSPEC;
-}
 
 static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
  int *qerr)
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [NET] core/utils.c: digit2bin is dead static inline

2008-01-12 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/core/utils.c |   11 ---
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/net/core/utils.c b/net/core/utils.c
index 34459c4..8031eb5 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -91,17 +91,6 @@ EXPORT_SYMBOL(in_aton);
 #define IN6PTON_NULL   0x2000  /* first/tail */
 #define IN6PTON_UNKNOWN0x4000
 
-static inline int digit2bin(char c, int delim)
-{
-   if (c == delim || c == '\0')
-   return IN6PTON_DELIM;
-   if (c == '.')
-   return IN6PTON_DOT;
-   if (c = '0'  c = '9')
-   return (IN6PTON_DIGIT | (c - '0'));
-   return IN6PTON_UNKNOWN;
-}
-
 static inline int xdigit2bin(char c, int delim)
 {
if (c == delim || c == '\0')
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-12 Thread Ilpo Järvinen

On Sat, 12 Jan 2008, Herbert Xu wrote:

 On Sat, Jan 12, 2008 at 09:40:17AM +, Ilpo Järvinen wrote:
 
 Your emails are now using UTF-8 encoding but it's still declaring
 ISO-8859-1 as the charset. 

Thanks for trying to help but my situation is such that I think it got 
also you confused (this kind of mixed encoding is beoynd my skills 
really)... :-) Besides, I wouldn't mind of having incorrect characters
in my name, I'm just used to that but somebody else wasn't that happy 
about it.

Here's one example...

From: =?utf-8?q?Ilpo_J=C3=A4rvinen?= [EMAIL PROTECTED]
...
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Something still needed besides these to declare email utf-8?

 So you probably want to fix that up or your name may show up as JÃ¤vinen 
 on the reader's screen.

Did you actually see this? I'd expect to see that as well but no, From is 
correctly decoded by my ISO-8859-1'ish MUA, aha, seems that I still have 
something to do to deal with the Signed-off line.

...Maybe I just fall-back to changing my last name, it's the only 
full-proof solution... ;-)

-- 
 i.

TCP hints

2008-01-11 Thread Ilpo Järvinen

Hi Stephen,

Do you still remember what this is for (got added along with other TCP 
hint stuff)? What kind of problem you saw back then (or who saw
problems)?

@@ -1605,6 +1711,10 @@ static void tcp_undo_cwr(struct sock *sk, const int 
undo)
}
tcp_moderate_cwnd(tp);
tp-snd_cwnd_stamp = tcp_time_stamp;
+
+   /* There is something screwy going on with the retrans hints after
+  an undo */
+   clear_all_retrans_hints(tp);
 }
 
 static inline int tcp_may_undo(struct tcp_sock *tp)


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-10 Thread Ilpo Järvinen

On Tue, 8 Jan 2008, Ilpo Järvinen wrote:

 On Mon, 7 Jan 2008, David Miller wrote:
 
  From: Andi Kleen [EMAIL PROTECTED]
  Date: Tue, 8 Jan 2008 06:00:07 +0100
  
   On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
The vast majority of them are one, two, and three liners.
   
   % awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { 
   len=line-start-3; if (len  4) l++; if (len = 10) k++; } ; END { print 
   total, l, l/total, k, k/total }'  include/net/tcp.h
   68 28 0.411765 20 0.294118
   
   41% are over 4 lines, 29% are = 10 lines.
  
  Take out the comments and whitespace lines, your script is
  too simplistic.

In addition it triggered spuriously per struct/enum end brace :-) and
was using the last known function starting brace in there so no wonder
the numbers were that high... Counting with the corrected lines 
(len=line-start-1)  spurious matches removed:

74 19 0.256757 7 0.0945946


Here are (finally) the measured bytes (couple of the functions are 
missing because I had couple of bugs in the regexps and the #if trickery 
at the inline resulted failed compiles):

 12 funcs, 242+, 1697-, diff: -1455  tcp_set_state 
 13 funcs, 92+, 632-, diff: -540 tcp_is_cwnd_limited 
 12 funcs, 2836+, 3225-, diff: -389  tcp_current_ssthresh 
 5 funcs, 261+, 556-, diff: -295 tcp_prequeue 
 7 funcs, 2777+, 3049-, diff: -272   tcp_clear_retrans_hints_partial 
 11 funcs, 64+, 275-, diff: -211 tcp_win_from_space 
 6 funcs, 128+, 320-, diff: -192 tcp_prequeue_init 
 12 funcs, 45+, 209-, diff: -164 tcp_set_ca_state 
 7 funcs, 106+, 237-, diff: -131 tcp_fast_path_check 
 5 funcs, 167+, 291-, diff: -124 tcp_write_queue_purge 
 6 funcs, 43+, 160-, diff: -117  tcp_push_pending_frames 
 9 funcs, 55+, 159-, diff: -104  tcp_v4_check 
 6 funcs, 4+, 97-, diff: -93 tcp_packets_in_flight 
 7 funcs, 58+, 150-, diff: -92   tcp_fast_path_on 
 4 funcs, 4+, 91-, diff: -87 tcp_clear_options 
 6 funcs, 141+, 217-, diff: -76  tcp_openreq_init 
 8 funcs, 38+, 111-, diff: -73   tcp_unlink_write_queue 
 7 funcs, 32+, 103-, diff: -71   tcp_checksum_complete 
 7 funcs, 35+, 101-, diff: -66   __tcp_fast_path_on 
 5 funcs, 4+, 66-, diff: -62 tcp_receive_window 
 6 funcs, 67+, 128-, diff: -61   tcp_add_write_queue_tail 
 7 funcs, 30+, 86-, diff: -56tcp_ca_event 
 6 funcs, 73+, 106-, diff: -33   tcp_paws_check 
 4 funcs, 4+, 36-, diff: -32 tcp_highest_sack_seq 
 6 funcs, 46+, 78-, diff: -32tcp_fin_time 
 3 funcs, 4+, 35-, diff: -31 tcp_clear_all_retrans_hints 
 7 funcs, 30+, 51-, diff: -21__tcp_add_write_queue_tail 
 3 funcs, 4+, 14-, diff: -10 tcp_enable_fack 
 4 funcs, 4+, 14-, diff: -10 keepalive_time_when 
 8 funcs, 66+, 73-, diff: -7 tcp_full_space 
 3 funcs, 4+, 5-, diff: -1   tcp_wnd_end 
 4 funcs, 97+, 97-, diff: +0 tcp_mib_init 
 3 funcs, 4+, 3-, diff: +1   tcp_skb_is_last 
 2 funcs, 4+, 2-, diff: +2   keepalive_intvl_when 
 2 funcs, 4+, 2-, diff: +2   tcp_is_fack 
 2 funcs, 4+, 2-, diff: +2   tcp_skb_mss 
 2 funcs, 4+, 2-, diff: +2   tcp_write_queue_empty 
 2 funcs, 4+, 2-, diff: +2   tcp_advance_highest_sack 
 2 funcs, 4+, 2-, diff: +2   tcp_advance_send_head 
 2 funcs, 4+, 2-, diff: +2   tcp_check_send_head 
 2 funcs, 4+, 2-, diff: +2   tcp_highest_sack_reset 
 2 funcs, 4+, 2-, diff: +2   tcp_init_send_head 
 2 funcs, 4+, 2-, diff: +2   tcp_sack_reset 
 6 funcs, 47+, 44-, diff: +3 tcp_space 
 5 funcs, 55+, 50-, diff: +5 tcp_too_many_orphans 
 3 funcs, 8+, 2-, diff: +6   tcp_minshall_update 
 3 funcs, 8+, 2-, diff: +6   tcp_update_wl 
 8 funcs, 25+, 14-, diff: +11between 
 3 funcs, 14+, 2-, diff: +12 tcp_put_md5sig_pool 
 3 funcs, 14+, 2-, diff: +12 tcp_clear_xmit_timers 
 5 funcs, 30+, 17-, diff: +13tcp_dec_pcount_approx_int 
 6 funcs, 33+, 20-, diff: +13tcp_insert_write_queue_after 
 3 funcs, 17+, 2-, diff: +15 __tcp_checksum_complete 
 5 funcs, 17+, 2-, diff: +15 tcp_init_wl 
 4 funcs, 57+, 41-, diff: +16tcp_dec_quickack_mode 
 4 funcs, 40+, 22-, diff: +18__tcp_add_write_queue_head 
 5 funcs, 36+, 16-, diff: +20tcp_highest_sack_combine 
 4 funcs, 40+, 18-, diff: +22tcp_dec_pcount_approx 
 6 funcs, 29+, 5-, diff: +24 tcp_is_sack 
 4 funcs, 28+, 2-, diff: +26 tcp_is_reno 
 5 funcs, 50+, 24-, diff: +26tcp_insert_write_queue_before 
 4 funcs, 83+, 56-, diff: +27tcp_check_probe_timer 
 8 funcs, 69+, 14-, diff: +55tcp_left_out 
 11 funcs, 2995+, 2893-, diff: +102  tcp_skb_pcount 
 30 funcs, 930+, 2-, diff: +928  before 

-- 
 i.

Re: [PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat

2008-01-09 Thread Ilpo Järvinen

On Sat, 5 Jan 2008, Arnaldo Carvalho de Melo wrote:

 Em Sat, Jan 05, 2008 at 03:39:04PM +0200, Ilpo Järvinen escreveu:
  Hi Dave,
  
  After Arnaldo got codiff's inline instrumentation bugs fixed
  (thanks! :-)), I got my .c-inline-bloat-o-meter to power up
  reliably after some tweaking and bug fixing on my behalf...
  It shows some very high readings every now and then in the
  code under net/.
 
 Thank you for the reports and for showing how these tools can be put to
 good use! 
 
 If you have any further suggestions on how to make codiff and the
 dwarves to be of more help or find any other bug, please let me know.

It could use a bit less memory because my header inline checking attempt 
on a machine with 2G+2G ends up like this:

$ codiff vmlinux.o.base vmlinux.o
libclasses: out of memory(inline_expansion__new)
$ ls -al vmlinux.o{,.base}
-rw-r- 1 ijjarvin tkol 633132586 Jan  9 13:11 vmlinux.o
-rw-r- 1 ijjarvin tkol 633132572 Jan  9 00:58 vmlinux.o.base
$

Considering that's only 0.6G+0.6G I've a problem in understanding why 
codiff's number crunching eats up so much memory.

I just hope there isn't any O(n^2) or worse algos in it either once
the memory consumption gets resolved. :-)


-- 
 i.

Re: SACK scoreboard

2008-01-09 Thread Ilpo Järvinen

On Tue, 8 Jan 2008, John Heffner wrote:

 Andi Kleen wrote:
  David Miller [EMAIL PROTECTED] writes:
   The big problem is that recovery from even a single packet loss in a
   window makes us run kfree_skb() for a all the packets in a full
   window's worth of data when recovery completes.
  
  Why exactly is it a problem to free them all at once? Are you worried
  about kernel preemption latencies?
 
 I also wonder how much of a problem this is (for now, with window sizes of
 order 1 packets.  My understanding is that the biggest problems arise from
 O(N^2) time for recovery because every ack was expensive. Have current 
 tests shown the final ack to be a major source of problems?

This thread got started because I tried to solve the other latencies but 
realized that it helps very little because this latency spike would 
have remained unsolved and it happens in one of the most common case.

-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-08 Thread Ilpo Järvinen

On Mon, 7 Jan 2008, David Miller wrote:

 From: Andi Kleen [EMAIL PROTECTED]
 Date: Tue, 8 Jan 2008 06:00:07 +0100

  On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
   The vast majority of them are one, two, and three liners.

  % awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { 
  len=line-start-3; if (len  4) l++; if (len = 10) k++; } ; END { print 
  total, l, l/total, k, k/total }'  include/net/tcp.h
  68 28 0.411765 20 0.294118

  41% are over 4 lines, 29% are = 10 lines.

 Take out the comments and whitespace lines, your script is
 too simplistic.

He should also remove these, which involve just syntactic casting to 
please the compiler:

struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);

...and maybe some other similar ones. I'm sure that's going to make his 
statistics even more questionable as is because we need tp all around.

But about his concerns, I spend couple of minutes in looking to it by 
hand. These are likely to be a win:
tcp_set_state
tcp_prequeue 
tcp_prequeue_init
tcp_openreq_init

...about rest I'm very unsure but there are some that probably are a 
minor win as well.

tcp_dec_quickack_mode has only a single caller anyway so I'd doubt that 
it's going to be significant, I thought moving it to .c earlier but 
haven't just done that yet :-).

-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-08 Thread Ilpo Järvinen

On Tue, 8 Jan 2008, Andi Kleen wrote:

 David Miller [EMAIL PROTECTED] writes:
 
  Similarly I question just about any inline usage at all in *.c files
 
 Don't forget the .h files. Especially a lot of stuff in tcp.h should
 be probably in some .c file and not be inline.

I'm not forgetting them... :-)

My intention is to do the same for all headers as well but I'd really like 
to get some actual number of bytes by measuring rather than taking them of 
the hat.

...It just requires some work still to get the uninlining actually do the 
right thing and actually testing it involves orders of magnitude more 
exhaustive recompilation than .o only checking required so after finishing 
the script I either has to setup n distcc master with a number of slaves 
each or wait for some time for the results :-).

However, I suspect that anything in tcp.h won't match to something like 
~5-10k like I've seen in three .c files already + one with debugging that 
has even more than 10k (even if all in tcp.h summed up :-)).

-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SACK scoreboard

2008-01-08 Thread Ilpo Järvinen

On Mon, 7 Jan 2008, David Miller wrote:

 Did you happen to read a recent blog posting of mine?
 
   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead
 
 I've been thinking more and more and I think we might be able
 to get away with enforcing that SACKs are always increasing in
 coverage.
 
 I doubt there are any real systems out there that drop out of order
 packets that are properly formed and are in window, even though the
 SACK specification (foolishly, in my opinion) allows this.

Luckily we can see that already from MIBs, so quering people who have 
large servers, which are continously testing the internet :-), under 
their supervision or can access, and asking if they see any might help.
I checked my dept's interactive servers and all had zero renegings, but
I don't think I have access to www server which would have much wider 
exposure.

 If we could free packets as SACK blocks cover them, all the problems
 go away.

I thought it a bit yesterday after reading your blog and came to 
conclusion that they won't, we can still get those nasty ACKs regardless 
of received SACK info (in here, missing). Even in some valid cases which 
include ACK losses besides actual data loss, not that this is the most 
common case but just wanted to point out that cleanup work is at least 
partially independent of SACK problem. So not all problems would go
away really.

 For one thing, this will allow the retransmit queue liberation during
 loss recovery to be spread out over the event, instead of batched up
 like crazy to the point where the cumulative ACK finally moves and
 releases an entire window's worth of data.

Two key cases for real pattern are:

1. Losses once per n, where n is something small, like 2-20 or so, usually
   happens at slow start overshoot or when compething traffic slow starts. 
   Cumulative ACKs will cover only small part of the window once rexmits 
   make through, thus this is not a problem.
2. Single loss (or few at the beginning of the window), rest SACKed. 
   Cumulative ACK will cover original window when the last necessary 
   rexmit gets through.

Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case 
the arriving SACK blocks make the rest of the window equal to 2 :-).

So I'm now trying to solve just case 2. What if we could somehow combine 
adjacent skbs (or whatever they're called in that model) if SACK covers 
them both so that we still hold them but can drop them in a very 
efficient way. That would make the combining effort split per ACK. 
And if reneging would occur, we can think a way to put the necessary fuzz 
into a form which cannot hurt the rest of the system (relatively easy  
fast if we add CA_Reneging and allow retransmitting a portion of an skb 
similar to what you suggested earlier).

And it might even be possible then to offer admin a control so that the 
admin can choose between recover/plain reset if admin thinks that it's 
always an indication of an attack. This is somewhat similar case to what 
UTO (under IETF evaluation) does, as purpose of both is in violation of 
RFC TCP to avoid malicious traps but the control about it is left to the 
user.

 Next, it would simplify all of this scanning code trying to figure out
 which holes to fill during recovery.
 
 And for SACK scoreboard marking, the RB trie would become very nearly
 unecessary as far as I can tell.

I've been contacted by a person who was interested in reaching 500k 
windows, so your 4000 sounded like a joke :-/. Having, let say, every
20th dropped means 25k skbs remaining, can we scan though it in any
sensible time without RBs and friends :-)? However, allowing queue walk
to begin from either direction would solve most of the common cases well 
enough for it to be nearly manageable.

 I would not even entertain this kind of crazy idea unless I thought
 the fundamental complexity simplification payback was enormous.  And
 in this case I think it is.
 
 What we could do is put some experimental hack in there for developers
 to start playing with, which would enforce that SACKs always increase
 in coverage.  If violated the connection reset and a verbose log
 message is logged so we can analyze any cases that occur.

We have an initial number already, in MIBs.

 Sounds crazy, but maybe has potential.  What do you think?

If I'd hint my boss that I'm involved in something like this I'd
bet that he also would get quite crazy... ;-) I'm partially paid
for making TCP more RFCish :-), or at least that the places where
thing diverge are known and controllable for research purposes.


-- 
 i.

ps. If other Cced would like to get dropped if there are some followups, 
just let me know :-). Else, no need to do anything.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-07 Thread Ilpo Järvinen

On Sun, 6 Jan 2008, Herbert Xu wrote:

 is definitely not a fast path.  If a function ends up being called
 just once the compiler will most likely inline it anyway, making the
 use of the keyword inline redundant.

Unexpected enough, even this logic seems to fail in a way with my gcc, I'm 
yet to study it closer but it seems to me that e.g., uninlining only once 
called tcp_fastretrans_alert is worth of at least 100 bytes (note that 
it's not inlined by us, gcc did it all by itself)! Otherwise I'd fail to 
understand why I got -270 bytes from uninlining tcp_moderate_cwnd which is 
only 57 bytes as unlined with three call sites.

net/ipv4/tcp_input.c:
  tcp_undo_cwr  |  -48
  tcp_try_undo_recovery |  -55
  tcp_ack   | -2941
 3 functions changed, 3044 bytes removed, diff: -3044

net/ipv4/tcp_input.c:
  tcp_moderate_cwnd |  +57
  tcp_fastretrans_alert | +2717
 2 functions changed, 2774 bytes added, diff: +2774

net/ipv4/tcp_input.o:
 5 functions changed, 2774 bytes added, 3044 bytes removed, diff: -270

I'll probably force uninlining of it without tcp_moderate_cwnd noise and 
try a number of gcc versions.


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-06 Thread Ilpo Järvinen

On Sat, 5 Jan 2008, David Miller wrote:

 From: Herbert Xu [EMAIL PROTECTED]
 Date: Sun, 06 Jan 2008 11:29:35 +1100

  We should never use inline except when it's on the fast path and this
  is definitely not a fast path.  If a function ends up being called
  just once the compiler will most likely inline it anyway, making the
  use of the keyword inline redundant.

 Similarly I question just about any inline usage at all in *.c files
 these days.

 I even would discourage it's use for fast-path cases as well.

There still seems to be good candidates for inline in *.c files, in worst 
case I had +172 due to inline removed and ~60 are on +10-+90 range with 
my gcc, later gccs might do better but I definately would just blindly 
remove them all. Here's the other end of the list:

+35 static inline hci_encrypt_change_evtnet/bluetooth/hci_event.c
+36 static __inline__ tcp_in_window net/ipv4/tcp_minisocks.c
+38 static inline hci_conn_complete_evt net/bluetooth/hci_event.c
+38 static inline hci_conn_request_evt  net/bluetooth/hci_event.c
+42 static inline gred_wred_modenet/sched/sch_gred.c
+45 static inline secpath_has_nontransport  net/xfrm/xfrm_policy.c
+52 static inline bool port_match   net/netfilter/xt_tcpudp.c
+67 static inline dn_check_idf  net/decnet/dn_nsp_in.c
+90 static inline __ieee80211_queue_stopped net/mac80211/tx.c
+172 static inline sctp_chunk_length_valid  net/sctp/sm_statefuns.c

-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25] [NET]: Remove obsolete comment

2008-01-05 Thread Ilpo Järvinen

[PATCH] [NET]: Remove obsolete comment

It seems that ip_build_xmit is no longer used in here and
ip_append_data is used.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---

Hi Dave,

While reading some nearby code, I noticed this. I'm not 100%
sure if the removal is valid or not nor have interest to dig
it up from history but I suppose you know it better without
having to look it up from history :-).

 net/ipv4/ip_output.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3dc0c12..e57de0f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1339,8 +1339,6 @@ static int ip_reply_glue_bits(void *dptr, char *to, int 
offset,
  *
  * Should run single threaded per socket because it uses the sock
  * structure to pass arguments.
- *
- * LATER: switch from ip_build_xmit to ip_append_*
  */
 void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg 
*arg,
   unsigned int len)
-- 
1.5.0.6

[PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat

2008-01-05 Thread Ilpo Järvinen

Hi Dave,

After Arnaldo got codiff's inline instrumentation bugs fixed
(thanks! :-)), I got my .c-inline-bloat-o-meter to power up
reliably after some tweaking and bug fixing on my behalf...
It shows some very high readings every now and then in the
code under net/.

...Aand... we've a sovereign winner, though it was only fifth
on a kernel wide list (arch/ excluded due to number of
reasons) :-/.

Hall of (unquestionable) fame (measured per inline, top 10 under
net/):
  -4496 ctnetlink_parse_tuplenetfilter/nf_conntrack_netlink.c
  -2165 ctnetlink_dump_tuplesnetfilter/nf_conntrack_netlink.c
  -2115 __ip_vs_get_out_rt   ipv4/ipvs/ip_vs_xmit.c
  -1924 xfrm_audit_helper_pktinfoxfrm/xfrm_state.c
  -1799 ctnetlink_parse_tuple_proto  netfilter/nf_conntrack_netlink.c
  -1268 ctnetlink_parse_tuple_ip netfilter/nf_conntrack_netlink.c
  -1093 ctnetlink_exp_dump_expectnetfilter/nf_conntrack_netlink.c
  -1060 ccid3_update_send_interval   dccp/ccids/ccid3.c
  -983  ctnetlink_dump_tuples_proto  netfilter/nf_conntrack_netlink.c
  -827  ctnetlink_exp_dump_tuple netfilter/nf_conntrack_netlink.c

Removing inlines is done iteratively because e.g., uninlining
ctnetlink_parse_tuple affected ..._{proto,ip} prices as well.

i386/gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)/allyesconfig
except CONFIG_FORCED_INLINING. Tried without some CONFIG.*DEBUG
as well and got slightly better numbers for some functions, yet
the number don't differ enough to be that meaningful, ie., if  
there's bloat somewhere, removing DEBUGs won't make it go away.

...After this first aid we're at least below 1k :-).

--
 i.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] [IPVS]: Kill some bloat

2008-01-05 Thread Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= [EMAIL PROTECTED]

net/ipv4/ipvs/ip_vs_xmit.c:
  ip_vs_icmp_xmit   | -638
  ip_vs_tunnel_xmit | -674
  ip_vs_nat_xmit| -716
  ip_vs_dr_xmit | -682
 4 functions changed, 2710 bytes removed, diff: -2710

net/ipv4/ipvs/ip_vs_xmit.c:
  __ip_vs_get_out_rt | +595
 1 function changed, 595 bytes added, diff: +595

net/ipv4/ipvs/ip_vs_xmit.o:
 5 functions changed, 595 bytes added, 2710 bytes removed, diff: -2115

Without some CONFIG.*DEBUGs:

net/ipv4/ipvs/ip_vs_xmit.o:
 5 functions changed, 383 bytes added, 1513 bytes removed, diff: -1130

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/ipvs/ip_vs_xmit.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipvs/ip_vs_xmit.c b/net/ipv4/ipvs/ip_vs_xmit.c
index 1e96bf8..8436bf8 100644
--- a/net/ipv4/ipvs/ip_vs_xmit.c
+++ b/net/ipv4/ipvs/ip_vs_xmit.c
@@ -59,7 +59,7 @@ __ip_vs_dst_check(struct ip_vs_dest *dest, u32 rtos, u32 
cookie)
return dst;
 }
 
-static inline struct rtable *
+static struct rtable *
 __ip_vs_get_out_rt(struct ip_vs_conn *cp, u32 rtos)
 {
struct rtable *rt;  /* Route to the other host */
-- 
1.5.0.6

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 >

1 - 100 of 511 matches

Mail list logo