Re: Dropped connections with tcp_tw_recycle=1
Sven, tcp_tw_recycle is incompatible with NAT on the server side ... because it will enforce the verification of TCP time stamps. Unless all clients behind a NAT (actually PAD/masquerading) device use identical timestamps (within a certain range), most of them will send invalid TCP timestamps so SYNs will get dropped. I've been digging a bit more. [...] Thank you very much for your writeup regarding tcp_tw_recycle and timestamp verification. This is the part which I think I had already understood ... tcp_tw_recycle and _reuse's actual reuse of tw buckets seems to happen when setting up outbound connections. I haven't looked at those yet. ... but this is the part which I don't have a good understanding of yet. The outer conditional verifies that the incoming SYN has a timestamp, that tcp_tw_recycle is enabled, and that the origin exists in our peer cache. Note that it only checks the IP of the origin. Doesn't it make sense to also match on port? My understanding is that the fact that the connection is in TIME_WAIT implies that the source port should not be reused at this time. Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Dropped connections with tcp_tw_recycle=1
Nils Goroll wrote: The outer conditional verifies that the incoming SYN has a timestamp, that tcp_tw_recycle is enabled, and that the origin exists in our peer cache. Note that it only checks the IP of the origin. Doesn't it make sense to also match on port? My understanding is that the fact that the connection is in TIME_WAIT implies that the source port should not be reused at this time. Right, you're saying that the srcaddr+srcport pair of a connection in TIME_WAIT should not be reused under this scheme (i.e. the SYN can be dropped), and I agree. Then I don't understand why a new connection originating from a *different* source port (although from the same source IP) is also considered a dupe and dropped. SYN retries don't change/increase the source port afterall. Is this a mistake in the TCP code, or maybe in my understanding of the issue? Sven ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Dropped connections with tcp_tw_recycle=1
Sven, Right, you're saying that the srcaddr+srcport pair of a connection in TIME_WAIT should not be reused under this scheme (i.e. the SYN can be dropped), and I agree. Then I don't understand why a new connection originating from a *different* source port (although from the same source IP) is also considered a dupe and dropped. Are you referring to this code? if (tmp_opt.saw_tstamp tcp_death_row.sysctl_tw_recycle (dst = inet_csk_route_req(sk, req)) != NULL (peer = rt_get_peer((struct rtable *)dst)) != NULL peer-v4daddr == saddr) { if (xtime.tv_sec peer-tcp_ts_stamp + TCP_PAWS_MSL (s32)(peer-tcp_ts - req-ts_recent) TCP_PAWS_WINDOW) { NET_INC_STATS_BH(LINUX_MIB_PAWSPASSIVEREJECTED); dst_release(dst); goto drop_and_free; } } Again, I cannot tell you what the intention of the implementors might have been, but my interpretation is that they wanted to implement time stamp checking as a (from the security standpoint positive) side effect of tw_recycle. I haven't thought about how (or if) the tw_recycle code could be improved, because I believe the benefits of TCP state reuse is overrated and the disadvantages overweight the advantages. Also, my work focuses on OSes which don't have this issue ;-) Thanks, Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Dropped connections with tcp_tw_recycle=1
Hi Michael and all, tcp_tw_recycle is incompatible with NAT on the server side ... because it will enforce the verification of TCP time stamps. Unless all clients behind a NAT (actually PAD/masquerading) device use identical timestamps (within a certain range), most of them will send invalid TCP timestamps so SYNs will get dropped. Since you seem pretty knowledgeable on the subject, can you please explain the difference between tcp_tw_reuse and tcp_tw_recycle? I think I have understood the reason why tcp_tw_recycle does not work with NAT connections, but I must say I haven't fully devoured the linux TCP implementation to explain to you the design decisions regarding these two options. The very basic idea is to re-use tcp connections in TIME_WAIT state, saving the overhead of destroying and recreating TCP state. I remember that at one point I had thought to have understood the difference, but I can't recall at the moment. In short: I can tell you that you *must not* use tcp_tw_recycle for any machine talking to machines behind masquerading firewalls (iow, only use it inside isolated networks). But I cannot tell you what exactly it is supposed to do and what the difference is to tcp_tw_reuse. If anyone finds out, please let me know as well! Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Dropped connections with tcp_tw_recycle=1
Nils Goroll wrote: tcp_tw_recycle is incompatible with NAT on the server side ... because it will enforce the verification of TCP time stamps. Unless all clients behind a NAT (actually PAD/masquerading) device use identical timestamps (within a certain range), most of them will send invalid TCP timestamps so SYNs will get dropped. I've been digging a bit more. The drops happen because PAWS thinks they are old duplicate segments from earlier incarnations of the connection. A new incoming connection request will eventually call tcp_ipv4.c:tcp_v4_conn_request(), where we find the following code that ends up dropping some SYNs if recycling is enabled: if (tmp_opt.saw_tstamp tcp_death_row.sysctl_tw_recycle (dst = inet_csk_route_req(sk, req)) != NULL (peer = rt_get_peer((struct rtable *)dst)) != NULL peer-v4daddr == saddr) { if (get_seconds() peer-tcp_ts_stamp + TCP_PAWS_MSL (s32)(peer-tcp_ts - req-ts_recent) TCP_PAWS_WINDOW) { NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED); goto drop_and_release; } } The outer conditional verifies that the incoming SYN has a timestamp, that tcp_tw_recycle is enabled, and that the origin exists in our peer cache. Note that it only checks the IP of the origin. Doesn't it make sense to also match on port? The inner conditional tests two things: First, that the peer's last seen timestamp has not expired (it expires in 60 ticks). Next, that the new incoming timestamp [req-ts_recent] is at least one tick [TCP_PAWS_WINDOW] *before* the last seen timestamp from the peer [peer-tcp_ts] (i.e. that it's an old duplicate). (Also, you can verify if you get drops by checking the PAWSPassive value in /proc/net/netstat.) Here's the origin of the code, appx B.2 (b) in VJ et al's RFC 1323: An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection [peer-tcp_ts]. This value [peer-tcp_ts] could then be used in the PAWS mechanism to reject old duplicate segments [req] from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once [TCP_PAWS_WINDOW] since the old connection was open. -- http://tools.ietf.org/html/rfc1323#page-29 I'm wondering why the source port is not taken into consideration here. A previous incarnation of the connection would surely have the same source port? So if a new incoming connection has a different source port, it should not be a candidate for rejection. tcp_tw_recycle and _reuse's actual reuse of tw buckets seems to happen when setting up outbound connections. I haven't looked at those yet. Sven ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Dropped connections with tcp_tw_recycle=1
Hi Sven, I don't know the basis precise for it, but I can vouch for the fact that tcp_tw_recycle is incompatible with NAT on the server side. I would guess it is because the NAT gateway keeps a connection tracking list and is unhappy that the webserver is trying to reuse the same ip:port hash whilst it is registered in TIME_WAIT mode. There was a discussion of this previously: http://projects.linpro.no/pipermail/varnish-misc/2009-April/002764.html As you say tw_reuse works OK with NAT. Cheers, Nick. Sven Ulland wrote: I was recently debugging an issue where several clients experienced sporadic problems connecting to a website cached by varnish. Every now and then (say, something like every 20-50th TCP connection) would time out, or sometimes take a few SYNs before being accepted. Here's a typical example. It's observed at the spot marked 'X' in this network structure from the client network's perspective: [clients] - [NAT gateway] - [bridge firewall]X - [Internet] 0.00 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283647429 TSER=0 WS=6 2.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283648179 TSER=0 WS=6 8.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283649679 TSER=0 WS=6 20.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283652679 TSER=0 WS=6 44.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283658679 TSER=0 WS=6 93.00 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283670679 TSER=0 WS=6 93.00 varni-extip natgw-extip TCP http 4292 [SYN, ACK] TSV=2342207123 TSER=283670679 Note: The NAT gateway didn't do port translation here. Also, the timestamp values were not touched by the NAT gateway. The varnish node is behind LVS-TUN, but the LVS was not the culprit. After troubleshooting with the website owner, tcpdumping at various points on both sides, it was clear that the packets were reaching the varnish node, but except the last SYN, they were all dropped. This turned out to be because the varnish node had the tcp_tw_recycle sysctl enabled. Switching it off fixed the problem. The performance page on the varnish wiki features recommends Linux sysctl settings, including enabling tcp_tw_recycle, since april 2008. The recycle setting was removed from that page recently, but I would think there are a lot of installations around the world that have it enabled. I tried to figure out exactly how the recycling mechanism works, but the code is too complex to figure out without time or kernel network experience. Recycling was introduced by David Miller in 2.3.15, ref URL:http://lxr.linux.no/#linux-old+v2.3.15/net/ipv4/tcp_ipv4.c#L324 and e.g. URL:http://lxr.linux.no/#linux+v2.6.31/net/ipv4/tcp_ipv4.c#L1255. Do anyone have a good grasp on how it works, its connection to the RFC 1323 PAWS mechanism, and its claimed incompatibility with NAT (ref URL:http://lkml.org/lkml/2008/11/15/83)? When observing the same issue previously (dropped SYNs), I ditched tw_recycle in favour of tcp_tw_reuse, which doesn't seem to cause any problems (this was on a normal Apache system). It too is severely underdocumented, so I was hoping to shed some light on them both, and the exact circumstances where they are suitable for use. Sven ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Dropped connections with tcp_tw_recycle=1
On Sep 20, 2009, at 6:20 AM, Nils Goroll wrote: tcp_tw_recycle is incompatible with NAT on the server side ... because it will enforce the verification of TCP time stamps. Unless all clients behind a NAT (actually PAD/masquerading) device use identical timestamps (within a certain range), most of them will send invalid TCP timestamps so SYNs will get dropped. Since you seem pretty knowledgeable on the subject, can you please explain the difference between tcp_tw_reuse and tcp_tw_recycle? Thanks, --Michael ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Dropped connections with tcp_tw_recycle=1
I was recently debugging an issue where several clients experienced sporadic problems connecting to a website cached by varnish. Every now and then (say, something like every 20-50th TCP connection) would time out, or sometimes take a few SYNs before being accepted. Here's a typical example. It's observed at the spot marked 'X' in this network structure from the client network's perspective: [clients] - [NAT gateway] - [bridge firewall]X - [Internet] 0.00 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283647429 TSER=0 WS=6 2.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283648179 TSER=0 WS=6 8.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283649679 TSER=0 WS=6 20.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283652679 TSER=0 WS=6 44.99 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283658679 TSER=0 WS=6 93.00 natgw-extip varni-extip TCP 4292 http [SYN] TSV=283670679 TSER=0 WS=6 93.00 varni-extip natgw-extip TCP http 4292 [SYN, ACK] TSV=2342207123 TSER=283670679 Note: The NAT gateway didn't do port translation here. Also, the timestamp values were not touched by the NAT gateway. The varnish node is behind LVS-TUN, but the LVS was not the culprit. After troubleshooting with the website owner, tcpdumping at various points on both sides, it was clear that the packets were reaching the varnish node, but except the last SYN, they were all dropped. This turned out to be because the varnish node had the tcp_tw_recycle sysctl enabled. Switching it off fixed the problem. The performance page on the varnish wiki features recommends Linux sysctl settings, including enabling tcp_tw_recycle, since april 2008. The recycle setting was removed from that page recently, but I would think there are a lot of installations around the world that have it enabled. I tried to figure out exactly how the recycling mechanism works, but the code is too complex to figure out without time or kernel network experience. Recycling was introduced by David Miller in 2.3.15, ref URL:http://lxr.linux.no/#linux-old+v2.3.15/net/ipv4/tcp_ipv4.c#L324 and e.g. URL:http://lxr.linux.no/#linux+v2.6.31/net/ipv4/tcp_ipv4.c#L1255. Do anyone have a good grasp on how it works, its connection to the RFC 1323 PAWS mechanism, and its claimed incompatibility with NAT (ref URL:http://lkml.org/lkml/2008/11/15/83)? When observing the same issue previously (dropped SYNs), I ditched tw_recycle in favour of tcp_tw_reuse, which doesn't seem to cause any problems (this was on a normal Apache system). It too is severely underdocumented, so I was hoping to shed some light on them both, and the exact circumstances where they are suitable for use. Sven ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc