Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
On Fri, 21 Dec 2007, David Miller wrote: When Gavin respins the patch I'll look at in the context of submitting it as a bug fix. So Gavin please generate the patch against Linus's vanilla GIT tree or net-2.6, your choise. The existing patch was against Linus' linux-2.6.git from a few days ago so I've updated my tree and regenerated the patch (below). Is that the right one? I'm just checking through the existing CA modules. I don't see the rtt used for RTO anywhere. This is what I gather they're each using rtt for. tcp_highspeed.c doesn't implement .pkts_acked tcp_hybla.c doesn't implement .pkts_acked tcp_scalable.c doesn't implement .pkts_acked tcp_bic.c ignores rtt value from .pkts_acked tcp_lp.cseems to ignore rtt value from .pkts_acked (despite setting TCP_CONG_RTT_STAMP for high res rtts -- why?) tcp_vegas.c uses high res rtt to measure congestion signal, increase, backoff -- TCP_CONG_RTT_STAMP set so doesn't use seq_rtt tcp_veno.c uses high res rtt to measure congestion signal, increase, backoff -- TCP_CONG_RTT_STAMP set so doesn't use seq_rtt tcp_yeah.c uses high res rtt to measure congestion signal, increase, backoff -- TCP_CONG_RTT_STAMP set so doesn't use seq_rtt tcp_illinois.c uses rtt to scale increase, backoff -- TCP_CONG_RTT_STAMP set so doesn't use seq_rtt tcp_htcp.c uses rtt to scale increase, backoff tcp_cubic.c uses rtt to scale increase, backoff tcp_westwood.c scales backoff using rtt So as far as I can tell, timeout stuff is not ever altered using pkts_acked() so I guess this fix only affects westwood, htcp and cubic just now. I need to re-read properly, but I think the same problem affects the microsecond values where TCP_CONG_RTT_STAMP is set (used by vegas, veno, yeah, illinois). I might follow up with another patch which changes the behaviour where TCP_CONG_RTT_STAMP when I'm more sure of that. Thanks, Gavin Signed-off-by: Gavin McCullagh [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 889c893..6fb7989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2651,6 +2651,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, u32 cnt = 0; u32 reord = tp-packets_out; s32 seq_rtt = -1; + s32 ca_seq_rtt = -1; ktime_t last_ackt = net_invalid_timestamp(); while ((skb = tcp_write_queue_head(sk)) skb != tcp_send_head(sk)) { @@ -2686,13 +2687,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, if (sacked TCPCB_SACKED_RETRANS) tp-retrans_out -= packets_acked; flag |= FLAG_RETRANS_DATA_ACKED; + ca_seq_rtt = -1; seq_rtt = -1; if ((flag FLAG_DATA_ACKED) || (packets_acked 1)) flag |= FLAG_NONHEAD_RETRANS_ACKED; } else { + ca_seq_rtt = now - scb-when; if (seq_rtt 0) { - seq_rtt = now - scb-when; + seq_rtt = ca_seq_rtt; if (fully_acked) last_ackt = skb-tstamp; } @@ -2709,8 +2712,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, !before(end_seq, tp-snd_up)) tp-urg_mode = 0; } else { + ca_seq_rtt = now - scb-when; if (seq_rtt 0) { - seq_rtt = now - scb-when; + seq_rtt = ca_seq_rtt; if (fully_acked) last_ackt = skb-tstamp; } @@ -2772,8 +2776,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, net_invalid_timestamp())) rtt_us = ktime_us_delta(ktime_get_real(), last_ackt); - else if (seq_rtt 0) - rtt_us = jiffies_to_usecs(seq_rtt); + else if (ca_seq_rtt 0) + rtt_us = jiffies_to_usecs(ca_seq_rtt); } ca_ops-pkts_acked(sk, pkts_acked, rtt_us); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
Hi, On Tue, 18 Dec 2007, Gavin McCullagh wrote: The last attempt didn't take account of the situation where a timestamp wasn't available and tcp_clean_rtx_queue() has to feed both the RTO and the congestion avoidance. This updated patch stores both RTTs, making the delayed one available for the RTO and the other (ca_seq_rtt) available for congestion control. I forgot to include some data to show the difference this can make to the RTT signal: http://www.hamilton.ie/gavinmc/linux/tcp_clean_rtx_queue.html Gavin -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
Hi, On Wed, 19 Dec 2007, Ilpo Järvinen wrote: Isn't it also much better this way in a case where ACK losses happened, taking the longest RTT in that case is clearly questionable as it may over-estimate considerably. Quite so. However, another thing to consider is the possibility of this value being used in timeout-like fashion in ca modules (I haven't read enough ca modules code to know if any of them does that), on contrary to determinating just rtt or packet's delay in which case this change seems appropriate (most modules do the latter). I'm not aware of any, but I haven't read them all either. I would have thought tp-srtt was the value to use in this instance, but perhaps the individual timestamps including delack delay are useful. Therefore, if timeout-like module exists one should also add TCP_CONG_RTT_STAMP_LONGEST for that particular module and keep using seq_rtt for it like previously and use ca_seq_rtt only for others. Seems reasonable. I'll add this. This part doesn't exists anymore in development tree. Please base this patch (and anything in future) you intend to get included to mainline onto net-2.6.25 unless there's a very good reason to not do so or whatever 2.6.xx is the correct net development tree at that time (if one exists). Thanks. Will do. I gather I should use the latest net- tree in future when submitting patches. Thanks for the helpful comments, Gavin -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
The last attempt didn't take account of the situation where a timestamp wasn't available and tcp_clean_rtx_queue() has to feed both the RTO and the congestion avoidance. This updated patch stores both RTTs, making the delayed one available for the RTO and the other (ca_seq_rtt) available for congestion control. Signed-off-by: Gavin McCullagh [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 889c893..6fb7989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2651,6 +2651,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, u32 cnt = 0; u32 reord = tp-packets_out; s32 seq_rtt = -1; + s32 ca_seq_rtt = -1; ktime_t last_ackt = net_invalid_timestamp(); while ((skb = tcp_write_queue_head(sk)) skb != tcp_send_head(sk)) { @@ -2686,13 +2687,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, if (sacked TCPCB_SACKED_RETRANS) tp-retrans_out -= packets_acked; flag |= FLAG_RETRANS_DATA_ACKED; + ca_seq_rtt = -1; seq_rtt = -1; if ((flag FLAG_DATA_ACKED) || (packets_acked 1)) flag |= FLAG_NONHEAD_RETRANS_ACKED; } else { + ca_seq_rtt = now - scb-when; if (seq_rtt 0) { - seq_rtt = now - scb-when; + seq_rtt = ca_seq_rtt; if (fully_acked) last_ackt = skb-tstamp; } @@ -2709,8 +2712,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, !before(end_seq, tp-snd_up)) tp-urg_mode = 0; } else { + ca_seq_rtt = now - scb-when; if (seq_rtt 0) { - seq_rtt = now - scb-when; + seq_rtt = ca_seq_rtt; if (fully_acked) last_ackt = skb-tstamp; } @@ -2772,8 +2776,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, net_invalid_timestamp())) rtt_us = ktime_us_delta(ktime_get_real(), last_ackt); - else if (seq_rtt 0) - rtt_us = jiffies_to_usecs(seq_rtt); + else if (ca_seq_rtt 0) + rtt_us = jiffies_to_usecs(ca_seq_rtt); } ca_ops-pkts_acked(sk, pkts_acked, rtt_us); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC] TCP: use non-delayed ACK for congestion control RTT
When a delayed ACK representing two packets arrives, there are two RTT samples available, one for each packet. The first (in order of seq number) will be artificially long due to the delay waiting for the second packet, the second will trigger the ACK and so will not itself be delayed. According to rfc1323, the SRTT used for RTO calculation should use the first rtt, so receivers echo the timestamp from the first packet in the delayed ack. For congestion control however, it seems measuring delayed ack delay is not desirable as it varies independently of congestion. The patch below causes seq_rtt to be updated with any available later packet rtts which should have less (and hopefully zero) delack delay. The lower seq_rtt then gets passed to ca_ops-pkts_acked(). For non-delay based congestion control (cubic, h-tcp), rtt is sometimes used for rtt-scaling. In shortening the RTT, this may make them a little less aggressive. Delay-based schemes (eg vegas, illinois) should get a considerably cleaner, more accurate congestion signal, particularly for small cwnds. The congestion control module can potentially also filter out bad RTTs due to the delayed ack alarm by looking at the associated cnt which (where delayed acking is in use) should probably be 1 if the alarm went off or greater if the ACK was triggered by a packet. I seem to be undoing a design decision here so perhaps there is some reason this should not be done? Comments/explanations appreciated... Signed-off-by: Gavin McCullagh [EMAIL PROTECTED] --- a/net/ipv4/tcp_input.c 2007-12-15 00:22:23.0 + +++ b/net/ipv4/tcp_input.c 2007-12-17 13:35:16.0 + @@ -2691,11 +2691,9 @@ static int tcp_clean_rtx_queue(struct so (packets_acked 1)) flag |= FLAG_NONHEAD_RETRANS_ACKED; } else { - if (seq_rtt 0) { - seq_rtt = now - scb-when; - if (fully_acked) - last_ackt = skb-tstamp; - } + seq_rtt = now - scb-when; + if (fully_acked) + last_ackt = skb-tstamp; if (!(sacked TCPCB_SACKED_ACKED)) reord = min(cnt, reord); } @@ -2709,11 +2707,9 @@ static int tcp_clean_rtx_queue(struct so !before(end_seq, tp-snd_up)) tp-urg_mode = 0; } else { - if (seq_rtt 0) { - seq_rtt = now - scb-when; - if (fully_acked) - last_ackt = skb-tstamp; - } + seq_rtt = now - scb-when; + if (fully_acked) + last_ackt = skb-tstamp; reord = min(cnt, reord); } tp-packets_out -= packets_acked; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reading the tcp headers within the write queue
Hi, thanks for the swift reply. On Thu, 13 Dec 2007, David Miller wrote: I'm trying to hack together something which will run through the retransmit queue looking at the tcp headers. The packets in the retransmit queue are headerless, the header only gets added to clones of the retransmit queue frames during the actual transmit. Thought that might be it. I presume there isn't any other residue of the tcp options elsewhere, that one could look at when the packet gets acknowledged? I'm particularly interested in the timestamp. And this question belongs on netdev not linux-net. Oops, sorry. Gavin -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
possible bug in tcp_probe
Hi, I'm using linux v2.6.22.6 and tcp_probe with a couple of small modifications[1]. Even with moderately large numbers of flows (16 on the one machine) and increasingly as I monitor more flows than that, I get strange overflow problems such as this one: 74.259589763 192.168.2.1 36988 192.168.3.5 5001 0x679c23dc 0x679bc3b4 18 13 9114624 78 76 1 0 64 74.260590660 192.168.2.1 44261 192.168.3.5 5006 0x573bb3ed 0x573b700d 13 9 5254144 155 127 1 0 64 74.261607478 192.168.2.1 44261 192.168.3.5 5006 0x588.066586741 192.168.2.1 33739 192.168.3.5 5009 0xe26d1767 0xe26cf577 2 3 13090816 443 15818 1 0 64 88.066690797 192.168.2.1 33739 192.168.3.5 5009 0xe26d1767 0xe26cfb1f 3 3 13092864 2365 15818 1 0 64 88.067625714 192.168.2.1 59385 192.168.3.5 5012 0x411c1090 0x411bd258 12 9 14578688 2807 15812 1 0 64 As you can see the third line has been truncated as well as the next roughly 14 seconds of data after which data continues writing as usual. I don't think my small changes are causing this but perhaps I'm wrong. Does anyone know what might be causing the above? Many thanks for any ideas, Gavin [1] I have slightly modified tcp_probe to print out information for a range of ports (instead of one port or all) and to print info from the congestion avoidance inet_csk_ca struct. This adds a couple of extra fields to the end. If either of these are of interest as patches I'll happily submit them. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bug in tcp_probe
Hi Sangtae, On Tue, 13 Nov 2007, SANGTAE HA wrote: This is fixed in the current version of tcp_probe by Stephen. Please see the below. You can copy the current version of tcp_probe to your kernel version and it should work. Many thanks, I'll give this a try, Gavin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bcm43xx: Fix code for spec changes of 2/7/2007
Hi, On Sat, 10 Feb 2007, Matthew Garrett wrote: On Sat, Feb 10, 2007 at 06:55:50AM +0100, Michael Buesch wrote: It's likely that old cards still work with v4 firmware, but we don't know and it has to be tested. Care to do so? I'll check the revision of my 4306, but I think it's probably too new to be useful, unfortunately... I have a fairly old 4306 at home which I was lent by my boss and I've been struggling to get it working for the past week or two. What I find is that loads of firmwares which I try with bcm43xx-fwcutter complain about the firmware being old but then when I get a firmware which is not too old the device registers but doesn't seem to work -- iwlist can't see any access points and when I set the essid it doesn't associate with the AP. Can I be helpful as a tester? Gavin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fixing opt-ack DoS against TCP stack
Hi, [ moving this to netdev as requested ] On Tue, 09 Jan 2007, Stephen Hemminger wrote: Actually, this paper seems to be a zombified version of: http://www.cs.ucsd.edu/~savage/papers/CCR99.pdf Thanks. In fairness to them, the emphasis is slightly different, Savage et al are more interested in improving a receiver's performance, whereas Sherwood seems more interested in a DoS attack. However... It is not clear that current Linux systems are prone to the attack for a couple of reasons. First, Linux does more counts packets not bytes so extra ack's would be ignored. Turning on ABC would also help. The issue I'm raising is not as much how you could artifically increase Cwnd by dividing ACKs or sending them early. The issue as I see it is that if the receiver doesn't send dup acks, the sender never backs off and may eventually flood its own link. As the receiver need only ACK the odd packet, the amplification can be substantial. As this issue is already moreorless solved (in the research sense), I'm not keen to spend a lot of time writing it up. So, here's a quick throw together of a small example experiment I ran yesterday. http://www.hamilton.ie/gavinmc/drop_dupack_attack/ Lastly, the patch looks like it could cause more problems. It probably would break some application and other non-attacking TCP stacks. For this case, IMHO we need to wait for more research. If you want to pursue the problem, it needs to go through the RFC process. I must admit I didn't read the patch in detail. As I understand it, the fix should (in principal) be compatible with other TCP stacks who should just see an odd extra dropped packet and react with a duplicate ack as usual. Gavin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fixing opt-ack DoS against TCP stack
Hi, recently, a few of us came up with a novel (or so we thought) DoS attack against TCP. We spent some time implementing and testing it and found it to work worryingly well. It turns out that we are not the first to come across this attack. Rob Sherwood and colleagues in Maryland were a year or two ahead of us. They have published a paper entitled Misbehaving TCP Receivers Can Cause Internet-Wide Congestion Collapse. http://www.cs.umd.edu/~capveg/optack/optack-ccs05.pdf http://www.cs.umd.edu/~capveg/ http://www.kb.cert.org/vuls/id/102014 Linux appears not to have implemented any fix for this vulnerability, although Rob Sherwood wrote a patch against 2.4.24. http://www.cs.umd.edu/~capveg/optack/optack.patch There seems to be a brief mention of it on the fedora-security list but I can't find much discussion of it in linux circles otherwise. http://www.spinics.net/linux/fedora/fedora-security/msg00426.html http://www.securityfocus.com/bid/15468/ Is there some reason that this fix was not accepted or has this just slipped under people's radars? Should some fix not be implemented? The issue seems even more severe with the larger buffer sizes now in use. Gavin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix integer overflow in H-TCP congestion control
When using H-TCP with a single flow on a 500Mbit connection (or less actually), alpha can exceed 65000, so alpha needs to be a u32. Signed-off-by: Gavin McCullagh [EMAIL PROTECTED] Signed-off-by: Doug Leith [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c index 6edfe5e..8072b6d 100644 --- a/net/ipv4/tcp_htcp.c +++ b/net/ipv4/tcp_htcp.c @@ -23,7 +23,7 @@ module_param(use_bandwidth_switch, int, MODULE_PARM_DESC(use_bandwidth_switch, turn on/off bandwidth switcher); struct htcp { - u16 alpha; /* Fixed point arith, 7 */ + u32 alpha; /* Fixed point arith, 7 */ u8 beta; /* Fixed point arith, 7 */ u8 modeswitch; /* Delay modeswitch until we had at least one congestion event */ u32 last_cong; /* Time since last congestion event end */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix integer overflow in H-TCP congestion control
When using H-TCP with a single flow on a 500Mbit connection (or less actually), alpha can exceed 65000, so alpha needs to be a u32. Signed-off-by: Gavin McCullagh [EMAIL PROTECTED] Signed-off-by: Doug Leith [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp_htcp.c index 6edfe5e..8072b6d 100644 --- a/net/ipv4/tcp_htcp.c +++ b/net/ipv4/tcp_htcp.c @@ -23,7 +23,7 @@ module_param(use_bandwidth_switch, int, MODULE_PARM_DESC(use_bandwidth_switch, turn on/off bandwidth switcher); struct htcp { - u16 alpha; /* Fixed point arith, 7 */ + u32 alpha; /* Fixed point arith, 7 */ u8 beta; /* Fixed point arith, 7 */ u8 modeswitch; /* Delay modeswitch until we had at least one congestion event */ u32 last_cong; /* Time since last congestion event end */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html