Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-25 Thread Petar Bogdanovic
On Tue, Jun 24, 2014 at 10:39:46PM +1000, Darren Reed wrote:
 
 So the problem is this:
 * sshd tries to write to the socket, gets ENETUNREACH
 
 and then exits leading to the FIN packets being transmitted as the socket
 is closed down in the normal course of things but by the time it is doing
 the exit the network path has restored.

Right.   Here is the (slightly redacted) output of ktruss:

http://smokva.net/pcap/crane-trace-sshd.txt


 You've got public IP addresses in your capture file and you've made no
 mention of using NAT, so I'm going to assume that the box with sshd/ssh
 on it are connected to the Internet directly with some kind of cable modem
 or similar.

The client connects through a FreeBSD router where an older
pf-version is doing filtering, nat and altq (priq).

The server has a public IP.


 Are you able to cross check the events from sshd with log data from those
 devices?

The most cross checking I've done is comparing the two pcaps.  There is
not much to cross check otherwise, since nothing really complains about
anything..

Also the ssh tunnels seem to work fine with ipf disabled.  I'm running a
third 3GB job today with ipf disabled.  Without disabling ipf the job
wouldn't survive longer than 10-30 seconds.


 How about these two for me:
 netstat -s | grep -i unreach
 netstat -s | grep -i route

Sure:

# netstat -s | head -n20
icmp:
413 calls to icmp_error
0 errors not generated because old message was icmp
Output histogram:
echoreply: 100
unreach: 413
photuris: 6
4 messages with bad code fields
0 messages  minimum length
0 bad checksums
0 messages with bad length
0 multicast echo requests ignored
0 multicast timestamp requests ignored
Input histogram:
echoreply: 6
unreach: 731
echo: 100
timxceed: 102
100 message responses generated
0 path MTU changes

# netstat -s | grep -i unreach
unreach: 413
unreach: 731
0 dropped due to ICMP unreachable
0 address unreachable
0 port unreachable
0 dropped due to ICMP unreachable
0 packets rcvd for unreachable dest 

# netstat -s | grep -i route
0 SYNs dropped (no route or no space)
0 output packets discarded due to no route
0 no route available (output)
17313 output packets discarded due to no route
0 no route
0 bad router solicitation messages
0 bad router advertisement messages
0 router advertisement routes dropped
0 SYNs dropped (no route or no space)
0 no route available (output)


 And of course the other important thing to do in an experiment is to save
 the output of netstat -s at the start of a run and compare that with its
 output when the problem has been seen again.

Here is the diff of two files I created before and after a failed job:

# ls -la netstat.*
-rw-r--r--  1 root  wheel  16332 Jun 25 09:58 netstat.a
-rw-r--r--  1 root  wheel  16332 Jun 25 09:59 netstat.b

# diff -u netstat.a  netstat.b 
--- netstat.a   2014-06-25 09:58:29.0 +0200
+++ netstat.b   2014-06-25 09:59:16.0 +0200
@@ -29,61 +29,61 @@
0 membership reports received for groups to which we belong
0 membership reports sent
 tcp:
-   36143366 packets sent
-   6650618 data packets (13929288006 bytes)
+   36178462 packets sent
+   6684489 data packets (14022704288 bytes)
4682 data packets (4775110 bytes) retransmitted
-   17455512 ack-only packets (27272960 delayed)
+   17456129 ack-only packets (27274542 delayed)
0 URG only packets
0 window probe packets
-   12029336 window update packets
-   8939 control packets
+   12029943 window update packets
+   8947 control packets
0 send attempts resulted in self-quench
-   48058930 packets received
-   3572420 acks (for 13901480535 bytes)
-   50912 duplicate acks
+   48083472 packets received
+   3589267 acks (for 13994738366 bytes)
+   51243 duplicate acks
5 acks for unsent data
-   42343900 packets (64839052960 bytes) received in-sequence
-   13840 completely duplicate packets (852212 bytes)
+   42346063 packets (64885718854 bytes) received in-sequence
+   13866 completely duplicate packets (852212 bytes)
1651 old duplicate packets
781 packets with some dup. data (277997 bytes duped)
-   955822 out-of-order packets (1113010976 bytes)
+   955823 out-of-order packets (1113010976 bytes)
0 packets (0 bytes) of data after window
0 

Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-25 Thread Petar Bogdanovic
On Tue, Jun 24, 2014 at 11:39:47PM +1000, Darren Reed wrote:
 
 Oh, I forgot, there are internal code paths in ipfilter/npf that can
 return ENETUNREACH.
 
 If you are using NetBSD 6 with ipfilter, comparing the output of this:
 
 ipfstat | grep 'block reason'
 
 from before and after might be illuminating.
 
 Or maybe just compare the entire output of ipfstat and ipfstat -s
 from before and after.

No problem, ipfstat before and after failed tunnel (reproducing it now
is very easy):

# ls -la
total 12
drwxrwxrwt   2 root  wheel  512 Jun 25 10:10 .
drwxr-xr-x  19 root  wheel  512 Jun 20 20:51 ..
-rw-r--r--   1 root  wheel  535 Jun 25 10:09 ipfstat-s.1403683750
-rw-r--r--   1 root  wheel  535 Jun 25 10:10 ipfstat-s.1403683819
-rw-r--r--   1 root  wheel  805 Jun 25 10:09 ipfstat.1403683750
-rw-r--r--   1 root  wheel  806 Jun 25 10:10 ipfstat.1403683819


# diff -u ipfstat-s.1403683750 ipfstat-s.1403683819
--- ipfstat-s.14036837502014-06-25 10:09:10.0 +0200
+++ ipfstat-s.14036838192014-06-25 10:10:19.0 +0200
@@ -1,27 +1,27 @@
 IP states added:
-   17761 TCP
-   58310 UDP
+   17772 TCP
+   58329 UDP
92 ICMP
-   76918765 hits
-   439924 misses
+   77026414 hits
+   450385 misses
0 bucket full
0 maximum rule references
0 maximum
0 no memory
-   14 bkts in use
-   14 active
-   58402 expired
-   17747 closed
+   18 bkts in use
+   18 active
+   58418 expired
+   17757 closed
 State logging enabled
 
 State table bucket statistics:
-   14 in use   
+   18 in use   
100% hash efficiency
-   0.24% bucket usage
+   0.31% bucket usage
0 minimal length
1 maximal length
1.000 average length
 
 TCP Entries per state
  0 1 2 3 4 5 6 7 8 91011
- 0 0 0 0 3 0 0 0 0 0 8 3
+ 0 0 0 0 3 0 0 0 0 0 8 4


# diff -u ipfstat.1403683750 ipfstat.1403683819
--- ipfstat.1403683750  2014-06-25 10:09:10.0 +0200
+++ ipfstat.1403683819  2014-06-25 10:10:19.0 +0200
@@ -1,22 +1,22 @@
 bad packets:   in 0out 0
- IPv6 packets: in 0 out 5153
- input packets:blocked 53 passed 44336 nomatch 0 counted 0 
short 0
-output packets:blocked 5218 passed 60118 nomatch 0 counted 0 
short 0
+ IPv6 packets: in 0 out 5155
+ input packets:blocked 53 passed 92750 nomatch 0 counted 0 
short 0
+output packets:blocked 5239 passed 129793 nomatch 0 counted 0 
short 0
  input packets logged: blocked 0 passed 0
-output packets logged: blocked 65 passed 0
+output packets logged: blocked 84 passed 0
  packets logged:   input 0 output 0
- log failures: input 0 output 13
+ log failures: input 0 output 32
 fragment state(in):kept 0  lost 0  not fragmented 0
 fragment state(out):   kept 0  lost 0  not fragmented 0
-packet state(in):  kept 1601   lost 0
+packet state(in):  kept 1631   lost 0
 packet state(out): kept 27 lost 0
 ICMP replies:  2   TCP RSTs sent:  50
 Invalid source(in):0
-Result cache hits(in): 1935(out):  6778
+Result cache hits(in): 4644(out):  9491
 IN Pullups succeeded:  0   failed: 0
 OUT Pullups succeeded: 0   failed: 0
 Fastroute successes:   3538failures:   0
 TCP cksum fails(in):   0   (out):  0
-IPF Ticks: 781113
+IPF Ticks: 781251
 Packet log flags set: (0)
none



Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-25 Thread Petar Bogdanovic
On Tue, Jun 24, 2014 at 11:01:01PM +1000, Darren Reed wrote:
 
 From that capture file, there is only *one* packet with the FIN bit set
 (line 20076):

Yes.


 09:24:10.646480 IP (tos 0x0, ttl 64, id 6680, offset 0, flags [DF],
 proto TCP (6), length 1084)
 85.X.X.X.22  77.X.X.X.59412: Flags [FP.], cksum 0x744b (correct),
 seq 8611816:8612848, ack 9744, win 4197, options [nop,nop,TS val 26 ecr
 26], length 1032
 
 which suggests that it isn't aligned with the normal pattern that you
 are seeing?

Well, when everything works fine, I see two FINs when the connection
closes, one from each end.  Or did you mean something else?


 You might want to look more closely at the TCP session shutdown with
 wireshark to see what it thinks. Or a better idea would be to capture
 everything in/out of that NIC and concentrate on the part around where
 the session shuts down.

I just did a full tcpdump.  Then I filtered out the particular ssh
session, all imap and submission traffic and there was almost nothing
left except a few dns bits and two LLDP_Multicasts (?).


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-24 Thread Petar Bogdanovic
On Mon, Jun 23, 2014 at 10:16:21PM +0200, Manuel Bouyer wrote:
 
 So it could actually have closed the connection after that.
 Did your tcpdump sequences also capture ICMP traffics ?
 Did an ICMP network unreacheable packet show up ?

I just created a new capture, this time with icmp:

tcpdump -nvvv icmp or \( tcp port ssh
and not tcp src port 60227
and not tcp dst port 60227
and not tcp src port 60178
and not tcp dst port 60178 \)

60227 and 60178 were used by two interactive sessions.

Unfortunately, there is no ICMP in the capture except the first two
packets which were sent by me in order to test the filter expression:

http://smokva.net/pcap/crane-with-icmp.txt


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-24 Thread Petar Bogdanovic
On Mon, Jun 23, 2014 at 08:48:04PM +0200, Joerg Sonnenberger wrote:
 On Mon, Jun 23, 2014 at 07:19:41PM +0200, Petar Bogdanovic wrote:
  Good question, I forgot to add that to my original message.  Until now,
  only scheduled ssh-tunnels were affected.  interactive ssh-sessions seem
  fine, so does smtp, imap and http.
 
 Can you try to manually force the IPQoS option to lowdelay/throughput
 for both use cases?

Instead of:

/usr/bin/ssh -24fNn -L 22007:localhost:9102 -R 9101:localhost:9101 -R 
9103:localhost:9103 remote.host.de

I tried:

/usr/bin/ssh -24fNn -o IPQoS=lowdelay -L 22007:localhost:9102 (...)

and:

/usr/bin/ssh -24fNn -o IPQoS=throughput -L 22007:localhost:9102 (...)

but it failed again in both instances.


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-24 Thread Darren Reed
On 23/06/2014 8:24 PM, Petar Bogdanovic wrote:
 During the past few weeks the ssh-tunnels to a remote machine started
 failing randomly.  In a previous mail to tech-net I prematurely blamed
 ipfilter because disabling it yielded some immediate success.

 Unfortunately, subsequent testing showed that having npf enabled instead
 eventually lead to the same issues.

 What I know:

   * the server suddenly FINs the connection
   * the server ignores everything after that and sends about 20-30
 RSTs for lots of late ACKs sent by the client
   * ipmon is able to track the connection but misses the FIN
   * yet ipfilter manages to update its state table and reduces the
 TTL of the connection from 24h to 30s
   * a server-tcpdump captures the FIN
   * a client-tcpdump captures the same FIN
   * according to wireshark, the FINs in both pcaps have sequence
 numbers that indicate lost segments (which at least in one
 case makes little sense since it was captured directly at the
 source)
   * ssh and sshd both never try to tear down the connection
   * ssh reports that the remote end has closed the connection
   * sshd bails on a failed write() with ENETUNREACH

So the problem is this:
* sshd tries to write to the socket, gets ENETUNREACH

and then exits leading to the FIN packets being transmitted as the socket
is closed down in the normal course of things but by the time it is doing
the exit the network path has restored.

For ICMP packets to cause this, you would need to see many of them.

You've got public IP addresses in your capture file and you've made no
mention of using NAT, so I'm going to assume that the box with sshd/ssh
on it are connected to the Internet directly with some kind of cable modem
or similar.

Are you able to cross check the events from sshd with log data from those
devices?

For example, if the NIC facing outwards drops then you will get ENETUNREACH
because the destination with the default route has disappeared. Or if your
DHCP assigned IP address disappears briefly then again the route will
disappear
and ENETUNREACH.

How about these two for me:
netstat -s | grep -i unreach
netstat -s | grep -i route

And of course the other important thing to do in an experiment is to save
the output of netstat -s at the start of a run and compare that with its
output when the problem has been seen again.

Kind Regards,
Darren



Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-24 Thread Darren Reed
On 24/06/2014 10:39 PM, Darren Reed wrote:
 On 23/06/2014 8:24 PM, Petar Bogdanovic wrote:
 ...  * sshd bails on a failed write() with ENETUNREACH
 So the problem is this:
 * sshd tries to write to the socket, gets ENETUNREACH

 and then exits leading to the FIN packets being transmitted as the socket
 is closed down in the normal course of things but by the time it is doing
 the exit the network path has restored.

 For ICMP packets to cause this, you would need to see many of them.


Oh, I forgot, there are internal code paths in ipfilter/npf that can
return ENETUNREACH.

If you are using NetBSD 6 with ipfilter, comparing the output of this:

ipfstat | grep 'block reason'

from before and after might be illuminating.

Or maybe just compare the entire output of ipfstat and ipfstat -s
from before and after.

Kind Regards,
Darren



Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-24 Thread Petar Bogdanovic
On Tue, Jun 24, 2014 at 10:44:12AM +0200, Daniel Hartmeier wrote:
 
 Have you tried with all packet filters disabled, just to make sure?

When this started, I actually did disable ipfilter.  But back then the
failures were declining so when I finally decided to go filterless, the
problem was no longer reproducible---with or without ipf (or npf).

This week this changed and the tunnels were reliably failing again.. so
I did it again:

   | 2014-06-24 08:00:06 | 0 |  0 | f  |
   | 2014-06-24 09:12:27 | 0 |  0 | f  |
   | 2014-06-24 09:14:18 | 0 |  0 | f  |
   | 2014-06-24 09:22:51 | 0 |  0 | f  |
   | 2014-06-24 09:24:00 | 0 |  0 | f  |
   | 2014-06-24 09:46:48 | 0 |  0 | f  |
   | 2014-06-24 09:48:43 | 0 |  0 | f  |
   | 2014-06-24 09:50:16 | 0 |  0 | f  |
   | 2014-06-24 09:59:52 | 0 |  0 | f  |
   | 2014-06-24 10:05:08 | 0 |  0 | f  |
   | 2014-06-24 10:06:13 | 0 |  0 | f  |
   | 2014-06-24 10:14:20 | 0 |  0 | f  |
   | 2014-06-24 10:17:15 | 0 |  0 | f  |
   | 2014-06-24 10:22:56 | 0 |  0 | f  | --- disabled ipf here
   | 2014-06-24 10:49:24 | 7,830 |  3,005,086,948 | T  | --- successfuly 
transfered 3GB
   | 2014-06-24 10:22:56 | 0 |  0 | f  | --- enabled ipf here
   | 2014-06-24 12:12:45 | 0 |  0 | f  |

So the local filter actually does play some role here.. :)


 On Mon, Jun 23, 2014 at 12:24:08PM +0200, Petar Bogdanovic wrote:
 
  * sshd bails on a failed write() with ENETUNREACH
 
 This can be because IPF (and NPF) return this errno to local sockets
 when blocking a packet.
 
 Probably the blocked packet is part of the port forwarded connection,
 and sshd is reacting to the write error by closing the control
 connection.
 
 How do you filter your loopback interface?
 
 Do you have specific stateless pass rules for lo0?
 
 I've seen cases where, without explicit lo0 rules, the packet filter
 tried to track state on loopback, and saw packets only in one direction,
 leading to stalling sessions iff when window scaling was enabled and
 traffic was just flowing fast enough to fill windows. This could work
 fine for weeks, just to pop up after something unrelated was changed
 (adding a database index to increase query response speed).

I've been using stateless pass rules for lo0 in both cases, ipf and npf.

My ipf configuration is older and might contain some ambiguities,
especially with regard to the lack of lo0 rules in ipf6.conf:

# ipf.conf
pass in  quick on lo0 all
pass out quick on lo0 all

block in quick all with mbcast

block in  all
block out all

pass in  proto icmp icmp-type 3
pass out proto icmp icmp-type 3

pass in  proto icmp icmp-type 8 keep state
pass out proto icmp icmp-type 8 keep state

block return-rst  in proto tcp
block return-icmp in proto udp

(...)


# ipf6.conf
block in  quick all
block out quick all


The npf configuration is simpler and makes no distinction between ip4
and ip6:

# npf.conf
$tcp_in  = { domain, ... }
$udp_in  = { domain }

$tcp_out = { domain, ... }
$udp_out = { domain, ... }

procedure log {
   # ifconfig npflog0 create
   log: npflog0
}

group (default) {
   pass final on lo0 all

   pass stateful in  final proto icmp icmp-type 8 all
   pass stateful out final proto icmp icmp-type 8 all

   pass stateful in  final proto tcp to any port $tcp_in
   pass stateful in  final proto udp to any port $udp_in
   pass stateful out final proto tcp to any port $tcp_out
   pass stateful out final proto udp to any port $udp_out

   block return all
   pass out proto icmp icmp-type 3 all
   pass out proto tcp flags R/R all
}


Otherwise your interpretation is very intriguing given the fact that the
only connections failing are the ssh connections that do port-forwarding.

Worth mentioning: If transfers are short, they sometimes succeed even if
ipf is on.


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-23 Thread Petar Bogdanovic
On Mon, Jun 23, 2014 at 12:24:08PM +0200, Petar Bogdanovic wrote:
 
 Unfortunately, subsequent testing showed that having npf enabled instead
 eventually lead to the same issues.

Clarification:  both ipf and npf were/are configured and enabled at the
server and not somewhere in between (or at the client).


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-23 Thread Brandon Vincent
On 06/23/2014 09:25 AM, Petar Bogdanovic wrote:
 But the thing is.. the server apparently
 sends a FIN.. hence the server closes the
 connection, nothing in between. And it
 does so out of the blue without any
 messages or errors.

It's still hard to say that there is nothing interfering with the
network traffic. I've seen newer firewalls gracefully terminate (not
just drop) explicit FTP connections when the client negotiates without
SSL/TLS.

If not a firewall, I'm still betting on some sort of network change.

Brandon Vincent


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-23 Thread Mindaugas Rasiukevicius
Petar Bogdanovic pe...@smokva.net wrote:
 During the past few weeks the ssh-tunnels to a remote machine started
 failing randomly.  In a previous mail to tech-net I prematurely blamed
 ipfilter because disabling it yielded some immediate success.
 
 Unfortunately, subsequent testing showed that having npf enabled instead
 eventually lead to the same issues.
 
 ...
 
 P.S. pcaps of one failed tunnel are here:
   http://smokva.net/pcap/crane.tgz

Are you using 6.x or -current?  If latter, is it the latest -current?
You can feed the pcap file to npftest [1] using your config and see what
is going on with NPF state tracking.

[1] http://nxr.netbsd.org/xref/src/usr.sbin/npf/npftest/README

-- 
Mindaugas


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-23 Thread Petar Bogdanovic
On Mon, Jun 23, 2014 at 09:50:59AM -0700, Brandon Vincent wrote:
 On 06/23/2014 09:25 AM, Petar Bogdanovic wrote:
  But the thing is.. the server apparently
  sends a FIN.. hence the server closes the
  connection, nothing in between. And it
  does so out of the blue without any
  messages or errors.
 
 It's still hard to say that there is nothing interfering with the
 network traffic. I've seen newer firewalls gracefully terminate (not
 just drop) explicit FTP connections when the client negotiates without
 SSL/TLS.
 
 If not a firewall, I'm still betting on some sort of network change.

Due to the alternating nature of the issue, I would bet the same.  I
just can't wrap my head around the fact that the server actually closes
the connection.  If some other participant triggers the FIN, I'd expect
some evidence inside the pcaps.. but apart from the client being late
with ACKs, the connection shows nothing abnormal to the naive eye.


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-23 Thread Petar Bogdanovic
On Mon, Jun 23, 2014 at 05:53:10PM +0100, Mindaugas Rasiukevicius wrote:
 
 Are you using 6.x or -current?  If latter, is it the latest -current?

The setup is conservative, follows netbsd-6-0.


Re: something is randomly closing ssh-tunnels (was: ipfilter randomly dropping..)

2014-06-23 Thread Manuel Bouyer
On Mon, Jun 23, 2014 at 12:24:08PM +0200, Petar Bogdanovic wrote:
 During the past few weeks the ssh-tunnels to a remote machine started
 failing randomly.  In a previous mail to tech-net I prematurely blamed
 ipfilter because disabling it yielded some immediate success.
 
 Unfortunately, subsequent testing showed that having npf enabled instead
 eventually lead to the same issues.
 
 What I know:
 
   * the server suddenly FINs the connection
   * the server ignores everything after that and sends about 20-30
 RSTs for lots of late ACKs sent by the client
   * ipmon is able to track the connection but misses the FIN
   * yet ipfilter manages to update its state table and reduces the
 TTL of the connection from 24h to 30s
   * a server-tcpdump captures the FIN
   * a client-tcpdump captures the same FIN
   * according to wireshark, the FINs in both pcaps have sequence
 numbers that indicate lost segments (which at least in one
 case makes little sense since it was captured directly at the
 source)
   * ssh and sshd both never try to tear down the connection
   * ssh reports that the remote end has closed the connection
   * sshd bails on a failed write() with ENETUNREACH

So it could actually have closed the connection after that.
Did your tcpdump sequences also capture ICMP traffics ?
Did an ICMP network unreacheable packet show up ?

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--