Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Tue, Jun 24, 2014 at 10:39:46PM +1000, Darren Reed wrote: So the problem is this: * sshd tries to write to the socket, gets ENETUNREACH and then exits leading to the FIN packets being transmitted as the socket is closed down in the normal course of things but by the time it is doing the exit the network path has restored. Right. Here is the (slightly redacted) output of ktruss: http://smokva.net/pcap/crane-trace-sshd.txt You've got public IP addresses in your capture file and you've made no mention of using NAT, so I'm going to assume that the box with sshd/ssh on it are connected to the Internet directly with some kind of cable modem or similar. The client connects through a FreeBSD router where an older pf-version is doing filtering, nat and altq (priq). The server has a public IP. Are you able to cross check the events from sshd with log data from those devices? The most cross checking I've done is comparing the two pcaps. There is not much to cross check otherwise, since nothing really complains about anything.. Also the ssh tunnels seem to work fine with ipf disabled. I'm running a third 3GB job today with ipf disabled. Without disabling ipf the job wouldn't survive longer than 10-30 seconds. How about these two for me: netstat -s | grep -i unreach netstat -s | grep -i route Sure: # netstat -s | head -n20 icmp: 413 calls to icmp_error 0 errors not generated because old message was icmp Output histogram: echoreply: 100 unreach: 413 photuris: 6 4 messages with bad code fields 0 messages minimum length 0 bad checksums 0 messages with bad length 0 multicast echo requests ignored 0 multicast timestamp requests ignored Input histogram: echoreply: 6 unreach: 731 echo: 100 timxceed: 102 100 message responses generated 0 path MTU changes # netstat -s | grep -i unreach unreach: 413 unreach: 731 0 dropped due to ICMP unreachable 0 address unreachable 0 port unreachable 0 dropped due to ICMP unreachable 0 packets rcvd for unreachable dest # netstat -s | grep -i route 0 SYNs dropped (no route or no space) 0 output packets discarded due to no route 0 no route available (output) 17313 output packets discarded due to no route 0 no route 0 bad router solicitation messages 0 bad router advertisement messages 0 router advertisement routes dropped 0 SYNs dropped (no route or no space) 0 no route available (output) And of course the other important thing to do in an experiment is to save the output of netstat -s at the start of a run and compare that with its output when the problem has been seen again. Here is the diff of two files I created before and after a failed job: # ls -la netstat.* -rw-r--r-- 1 root wheel 16332 Jun 25 09:58 netstat.a -rw-r--r-- 1 root wheel 16332 Jun 25 09:59 netstat.b # diff -u netstat.a netstat.b --- netstat.a 2014-06-25 09:58:29.0 +0200 +++ netstat.b 2014-06-25 09:59:16.0 +0200 @@ -29,61 +29,61 @@ 0 membership reports received for groups to which we belong 0 membership reports sent tcp: - 36143366 packets sent - 6650618 data packets (13929288006 bytes) + 36178462 packets sent + 6684489 data packets (14022704288 bytes) 4682 data packets (4775110 bytes) retransmitted - 17455512 ack-only packets (27272960 delayed) + 17456129 ack-only packets (27274542 delayed) 0 URG only packets 0 window probe packets - 12029336 window update packets - 8939 control packets + 12029943 window update packets + 8947 control packets 0 send attempts resulted in self-quench - 48058930 packets received - 3572420 acks (for 13901480535 bytes) - 50912 duplicate acks + 48083472 packets received + 3589267 acks (for 13994738366 bytes) + 51243 duplicate acks 5 acks for unsent data - 42343900 packets (64839052960 bytes) received in-sequence - 13840 completely duplicate packets (852212 bytes) + 42346063 packets (64885718854 bytes) received in-sequence + 13866 completely duplicate packets (852212 bytes) 1651 old duplicate packets 781 packets with some dup. data (277997 bytes duped) - 955822 out-of-order packets (1113010976 bytes) + 955823 out-of-order packets (1113010976 bytes) 0 packets (0 bytes) of data after window 0
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Tue, Jun 24, 2014 at 11:39:47PM +1000, Darren Reed wrote: Oh, I forgot, there are internal code paths in ipfilter/npf that can return ENETUNREACH. If you are using NetBSD 6 with ipfilter, comparing the output of this: ipfstat | grep 'block reason' from before and after might be illuminating. Or maybe just compare the entire output of ipfstat and ipfstat -s from before and after. No problem, ipfstat before and after failed tunnel (reproducing it now is very easy): # ls -la total 12 drwxrwxrwt 2 root wheel 512 Jun 25 10:10 . drwxr-xr-x 19 root wheel 512 Jun 20 20:51 .. -rw-r--r-- 1 root wheel 535 Jun 25 10:09 ipfstat-s.1403683750 -rw-r--r-- 1 root wheel 535 Jun 25 10:10 ipfstat-s.1403683819 -rw-r--r-- 1 root wheel 805 Jun 25 10:09 ipfstat.1403683750 -rw-r--r-- 1 root wheel 806 Jun 25 10:10 ipfstat.1403683819 # diff -u ipfstat-s.1403683750 ipfstat-s.1403683819 --- ipfstat-s.14036837502014-06-25 10:09:10.0 +0200 +++ ipfstat-s.14036838192014-06-25 10:10:19.0 +0200 @@ -1,27 +1,27 @@ IP states added: - 17761 TCP - 58310 UDP + 17772 TCP + 58329 UDP 92 ICMP - 76918765 hits - 439924 misses + 77026414 hits + 450385 misses 0 bucket full 0 maximum rule references 0 maximum 0 no memory - 14 bkts in use - 14 active - 58402 expired - 17747 closed + 18 bkts in use + 18 active + 58418 expired + 17757 closed State logging enabled State table bucket statistics: - 14 in use + 18 in use 100% hash efficiency - 0.24% bucket usage + 0.31% bucket usage 0 minimal length 1 maximal length 1.000 average length TCP Entries per state 0 1 2 3 4 5 6 7 8 91011 - 0 0 0 0 3 0 0 0 0 0 8 3 + 0 0 0 0 3 0 0 0 0 0 8 4 # diff -u ipfstat.1403683750 ipfstat.1403683819 --- ipfstat.1403683750 2014-06-25 10:09:10.0 +0200 +++ ipfstat.1403683819 2014-06-25 10:10:19.0 +0200 @@ -1,22 +1,22 @@ bad packets: in 0out 0 - IPv6 packets: in 0 out 5153 - input packets:blocked 53 passed 44336 nomatch 0 counted 0 short 0 -output packets:blocked 5218 passed 60118 nomatch 0 counted 0 short 0 + IPv6 packets: in 0 out 5155 + input packets:blocked 53 passed 92750 nomatch 0 counted 0 short 0 +output packets:blocked 5239 passed 129793 nomatch 0 counted 0 short 0 input packets logged: blocked 0 passed 0 -output packets logged: blocked 65 passed 0 +output packets logged: blocked 84 passed 0 packets logged: input 0 output 0 - log failures: input 0 output 13 + log failures: input 0 output 32 fragment state(in):kept 0 lost 0 not fragmented 0 fragment state(out): kept 0 lost 0 not fragmented 0 -packet state(in): kept 1601 lost 0 +packet state(in): kept 1631 lost 0 packet state(out): kept 27 lost 0 ICMP replies: 2 TCP RSTs sent: 50 Invalid source(in):0 -Result cache hits(in): 1935(out): 6778 +Result cache hits(in): 4644(out): 9491 IN Pullups succeeded: 0 failed: 0 OUT Pullups succeeded: 0 failed: 0 Fastroute successes: 3538failures: 0 TCP cksum fails(in): 0 (out): 0 -IPF Ticks: 781113 +IPF Ticks: 781251 Packet log flags set: (0) none
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Tue, Jun 24, 2014 at 11:01:01PM +1000, Darren Reed wrote: From that capture file, there is only *one* packet with the FIN bit set (line 20076): Yes. 09:24:10.646480 IP (tos 0x0, ttl 64, id 6680, offset 0, flags [DF], proto TCP (6), length 1084) 85.X.X.X.22 77.X.X.X.59412: Flags [FP.], cksum 0x744b (correct), seq 8611816:8612848, ack 9744, win 4197, options [nop,nop,TS val 26 ecr 26], length 1032 which suggests that it isn't aligned with the normal pattern that you are seeing? Well, when everything works fine, I see two FINs when the connection closes, one from each end. Or did you mean something else? You might want to look more closely at the TCP session shutdown with wireshark to see what it thinks. Or a better idea would be to capture everything in/out of that NIC and concentrate on the part around where the session shuts down. I just did a full tcpdump. Then I filtered out the particular ssh session, all imap and submission traffic and there was almost nothing left except a few dns bits and two LLDP_Multicasts (?).
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Mon, Jun 23, 2014 at 10:16:21PM +0200, Manuel Bouyer wrote: So it could actually have closed the connection after that. Did your tcpdump sequences also capture ICMP traffics ? Did an ICMP network unreacheable packet show up ? I just created a new capture, this time with icmp: tcpdump -nvvv icmp or \( tcp port ssh and not tcp src port 60227 and not tcp dst port 60227 and not tcp src port 60178 and not tcp dst port 60178 \) 60227 and 60178 were used by two interactive sessions. Unfortunately, there is no ICMP in the capture except the first two packets which were sent by me in order to test the filter expression: http://smokva.net/pcap/crane-with-icmp.txt
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Mon, Jun 23, 2014 at 08:48:04PM +0200, Joerg Sonnenberger wrote: On Mon, Jun 23, 2014 at 07:19:41PM +0200, Petar Bogdanovic wrote: Good question, I forgot to add that to my original message. Until now, only scheduled ssh-tunnels were affected. interactive ssh-sessions seem fine, so does smtp, imap and http. Can you try to manually force the IPQoS option to lowdelay/throughput for both use cases? Instead of: /usr/bin/ssh -24fNn -L 22007:localhost:9102 -R 9101:localhost:9101 -R 9103:localhost:9103 remote.host.de I tried: /usr/bin/ssh -24fNn -o IPQoS=lowdelay -L 22007:localhost:9102 (...) and: /usr/bin/ssh -24fNn -o IPQoS=throughput -L 22007:localhost:9102 (...) but it failed again in both instances.
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On 23/06/2014 8:24 PM, Petar Bogdanovic wrote: During the past few weeks the ssh-tunnels to a remote machine started failing randomly. In a previous mail to tech-net I prematurely blamed ipfilter because disabling it yielded some immediate success. Unfortunately, subsequent testing showed that having npf enabled instead eventually lead to the same issues. What I know: * the server suddenly FINs the connection * the server ignores everything after that and sends about 20-30 RSTs for lots of late ACKs sent by the client * ipmon is able to track the connection but misses the FIN * yet ipfilter manages to update its state table and reduces the TTL of the connection from 24h to 30s * a server-tcpdump captures the FIN * a client-tcpdump captures the same FIN * according to wireshark, the FINs in both pcaps have sequence numbers that indicate lost segments (which at least in one case makes little sense since it was captured directly at the source) * ssh and sshd both never try to tear down the connection * ssh reports that the remote end has closed the connection * sshd bails on a failed write() with ENETUNREACH So the problem is this: * sshd tries to write to the socket, gets ENETUNREACH and then exits leading to the FIN packets being transmitted as the socket is closed down in the normal course of things but by the time it is doing the exit the network path has restored. For ICMP packets to cause this, you would need to see many of them. You've got public IP addresses in your capture file and you've made no mention of using NAT, so I'm going to assume that the box with sshd/ssh on it are connected to the Internet directly with some kind of cable modem or similar. Are you able to cross check the events from sshd with log data from those devices? For example, if the NIC facing outwards drops then you will get ENETUNREACH because the destination with the default route has disappeared. Or if your DHCP assigned IP address disappears briefly then again the route will disappear and ENETUNREACH. How about these two for me: netstat -s | grep -i unreach netstat -s | grep -i route And of course the other important thing to do in an experiment is to save the output of netstat -s at the start of a run and compare that with its output when the problem has been seen again. Kind Regards, Darren
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On 24/06/2014 10:39 PM, Darren Reed wrote: On 23/06/2014 8:24 PM, Petar Bogdanovic wrote: ... * sshd bails on a failed write() with ENETUNREACH So the problem is this: * sshd tries to write to the socket, gets ENETUNREACH and then exits leading to the FIN packets being transmitted as the socket is closed down in the normal course of things but by the time it is doing the exit the network path has restored. For ICMP packets to cause this, you would need to see many of them. Oh, I forgot, there are internal code paths in ipfilter/npf that can return ENETUNREACH. If you are using NetBSD 6 with ipfilter, comparing the output of this: ipfstat | grep 'block reason' from before and after might be illuminating. Or maybe just compare the entire output of ipfstat and ipfstat -s from before and after. Kind Regards, Darren
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Tue, Jun 24, 2014 at 10:44:12AM +0200, Daniel Hartmeier wrote: Have you tried with all packet filters disabled, just to make sure? When this started, I actually did disable ipfilter. But back then the failures were declining so when I finally decided to go filterless, the problem was no longer reproducible---with or without ipf (or npf). This week this changed and the tunnels were reliably failing again.. so I did it again: | 2014-06-24 08:00:06 | 0 | 0 | f | | 2014-06-24 09:12:27 | 0 | 0 | f | | 2014-06-24 09:14:18 | 0 | 0 | f | | 2014-06-24 09:22:51 | 0 | 0 | f | | 2014-06-24 09:24:00 | 0 | 0 | f | | 2014-06-24 09:46:48 | 0 | 0 | f | | 2014-06-24 09:48:43 | 0 | 0 | f | | 2014-06-24 09:50:16 | 0 | 0 | f | | 2014-06-24 09:59:52 | 0 | 0 | f | | 2014-06-24 10:05:08 | 0 | 0 | f | | 2014-06-24 10:06:13 | 0 | 0 | f | | 2014-06-24 10:14:20 | 0 | 0 | f | | 2014-06-24 10:17:15 | 0 | 0 | f | | 2014-06-24 10:22:56 | 0 | 0 | f | --- disabled ipf here | 2014-06-24 10:49:24 | 7,830 | 3,005,086,948 | T | --- successfuly transfered 3GB | 2014-06-24 10:22:56 | 0 | 0 | f | --- enabled ipf here | 2014-06-24 12:12:45 | 0 | 0 | f | So the local filter actually does play some role here.. :) On Mon, Jun 23, 2014 at 12:24:08PM +0200, Petar Bogdanovic wrote: * sshd bails on a failed write() with ENETUNREACH This can be because IPF (and NPF) return this errno to local sockets when blocking a packet. Probably the blocked packet is part of the port forwarded connection, and sshd is reacting to the write error by closing the control connection. How do you filter your loopback interface? Do you have specific stateless pass rules for lo0? I've seen cases where, without explicit lo0 rules, the packet filter tried to track state on loopback, and saw packets only in one direction, leading to stalling sessions iff when window scaling was enabled and traffic was just flowing fast enough to fill windows. This could work fine for weeks, just to pop up after something unrelated was changed (adding a database index to increase query response speed). I've been using stateless pass rules for lo0 in both cases, ipf and npf. My ipf configuration is older and might contain some ambiguities, especially with regard to the lack of lo0 rules in ipf6.conf: # ipf.conf pass in quick on lo0 all pass out quick on lo0 all block in quick all with mbcast block in all block out all pass in proto icmp icmp-type 3 pass out proto icmp icmp-type 3 pass in proto icmp icmp-type 8 keep state pass out proto icmp icmp-type 8 keep state block return-rst in proto tcp block return-icmp in proto udp (...) # ipf6.conf block in quick all block out quick all The npf configuration is simpler and makes no distinction between ip4 and ip6: # npf.conf $tcp_in = { domain, ... } $udp_in = { domain } $tcp_out = { domain, ... } $udp_out = { domain, ... } procedure log { # ifconfig npflog0 create log: npflog0 } group (default) { pass final on lo0 all pass stateful in final proto icmp icmp-type 8 all pass stateful out final proto icmp icmp-type 8 all pass stateful in final proto tcp to any port $tcp_in pass stateful in final proto udp to any port $udp_in pass stateful out final proto tcp to any port $tcp_out pass stateful out final proto udp to any port $udp_out block return all pass out proto icmp icmp-type 3 all pass out proto tcp flags R/R all } Otherwise your interpretation is very intriguing given the fact that the only connections failing are the ssh connections that do port-forwarding. Worth mentioning: If transfers are short, they sometimes succeed even if ipf is on.
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Mon, Jun 23, 2014 at 12:24:08PM +0200, Petar Bogdanovic wrote: Unfortunately, subsequent testing showed that having npf enabled instead eventually lead to the same issues. Clarification: both ipf and npf were/are configured and enabled at the server and not somewhere in between (or at the client).
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On 06/23/2014 09:25 AM, Petar Bogdanovic wrote: But the thing is.. the server apparently sends a FIN.. hence the server closes the connection, nothing in between. And it does so out of the blue without any messages or errors. It's still hard to say that there is nothing interfering with the network traffic. I've seen newer firewalls gracefully terminate (not just drop) explicit FTP connections when the client negotiates without SSL/TLS. If not a firewall, I'm still betting on some sort of network change. Brandon Vincent
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
Petar Bogdanovic pe...@smokva.net wrote: During the past few weeks the ssh-tunnels to a remote machine started failing randomly. In a previous mail to tech-net I prematurely blamed ipfilter because disabling it yielded some immediate success. Unfortunately, subsequent testing showed that having npf enabled instead eventually lead to the same issues. ... P.S. pcaps of one failed tunnel are here: http://smokva.net/pcap/crane.tgz Are you using 6.x or -current? If latter, is it the latest -current? You can feed the pcap file to npftest [1] using your config and see what is going on with NPF state tracking. [1] http://nxr.netbsd.org/xref/src/usr.sbin/npf/npftest/README -- Mindaugas
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Mon, Jun 23, 2014 at 09:50:59AM -0700, Brandon Vincent wrote: On 06/23/2014 09:25 AM, Petar Bogdanovic wrote: But the thing is.. the server apparently sends a FIN.. hence the server closes the connection, nothing in between. And it does so out of the blue without any messages or errors. It's still hard to say that there is nothing interfering with the network traffic. I've seen newer firewalls gracefully terminate (not just drop) explicit FTP connections when the client negotiates without SSL/TLS. If not a firewall, I'm still betting on some sort of network change. Due to the alternating nature of the issue, I would bet the same. I just can't wrap my head around the fact that the server actually closes the connection. If some other participant triggers the FIN, I'd expect some evidence inside the pcaps.. but apart from the client being late with ACKs, the connection shows nothing abnormal to the naive eye.
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Mon, Jun 23, 2014 at 05:53:10PM +0100, Mindaugas Rasiukevicius wrote: Are you using 6.x or -current? If latter, is it the latest -current? The setup is conservative, follows netbsd-6-0.
Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)
On Mon, Jun 23, 2014 at 12:24:08PM +0200, Petar Bogdanovic wrote: During the past few weeks the ssh-tunnels to a remote machine started failing randomly. In a previous mail to tech-net I prematurely blamed ipfilter because disabling it yielded some immediate success. Unfortunately, subsequent testing showed that having npf enabled instead eventually lead to the same issues. What I know: * the server suddenly FINs the connection * the server ignores everything after that and sends about 20-30 RSTs for lots of late ACKs sent by the client * ipmon is able to track the connection but misses the FIN * yet ipfilter manages to update its state table and reduces the TTL of the connection from 24h to 30s * a server-tcpdump captures the FIN * a client-tcpdump captures the same FIN * according to wireshark, the FINs in both pcaps have sequence numbers that indicate lost segments (which at least in one case makes little sense since it was captured directly at the source) * ssh and sshd both never try to tear down the connection * ssh reports that the remote end has closed the connection * sshd bails on a failed write() with ENETUNREACH So it could actually have closed the connection after that. Did your tcpdump sequences also capture ICMP traffics ? Did an ICMP network unreacheable packet show up ? -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --