Adding some comments here. I'm playinh catch-up, so I may have comments
on some things that have been fixed, and missed others.
On 16/08/2018, 08:28, Mikael Abrahamsson wrote:
On Wed, 15 Aug 2018, Kent Watsen wrote:
You bring up an interesting point, it goes to the motivation for
wanting to do keepalives in the first place. The text doesn't yet
mention maintain flow state as a motivation.
It's not only to maintain flow state, it's also to close the
connection when the network goes down and doesn't work anymore, and
"give up" on connections that doesn't work anymore (for some
definition of "anymore").
I have operationally been in the situation where a server/client
application was implemented so that the server could only handle 256
connections (some filedescriptor limit). Every time the firewall was
rebooted, lost state, the connection hung around forever. So the
server administrators had to go in and restart the process to clear
these connections, otherwise there were 256 hung connections and no
new connections could be established.
Sometimes the other endpoint goes down, and doesn't come back. We will
for instance deploy home gateways probably keeping netconf-call-home
sessions to an NMS, and we want them to be around forever, as long as
they work. TCP level keepalives would solve this, as if the customer
just powers off the device, after a while the session will be cleared.
Using TCP keepalives here means you get this kind of behaviour even if
the upper-layer application doesn't support it (netconf might have
been a bad example here). It's a single socket option to set, so it's
very easy to do.
Agree. I think if we look to the transport layer that allowing a flow to
continue to use existing "network" state (in various forms) is an
important aspect - there are NATs, Firewalls, QoS Classifiers, etc as
well as load balancers, and layer 2/3's that take resource decisions at
the flow level. Normally all of these do the correct thing when there is
a continuous flow of packets.
Somewhere in the thread I also saw statement that suggested that
asosciations should be short-lived - If that advice is carried to the
transport layer, I would expect it to have serious impact on the
performance for some paths! (There are important trade-offs here, and we
should not make sweeping assumptions).
From knowing approximately what settings people have in their NAT44 and
firewalls etc, I'd say the recommendation should be that keepalives
are set to around 60-300 second interval, and then kill the connection
if no traffic has passed in 3-5 of these intervals, kill the
connection. Otherwise TCP will have backed off so far anyway, that
it's probably faster to just re-try the connection instead of waiting
for TCP to re-send the packet.
I have seen so many times in my 20 years working in networking where
lack of keepalives have caused all kinds of problems. I wish everybody
would turn it on and keep it on.
I agree. I have the feeling that this is at all not easy advice to get
correct in a general way (and this thread is quite there yet). e.g., RFC
5245 set lower limits for timers - because that was thought important.
I don't agree that protocol stacks with a secure transport protocol
layer (e.g., TLS, SSH, DTLS) that sits on top of a cleartext protocol
layer (e.g., TCP, UDP) should be advised to do the aliveness check only
within protection envelope afforded by the secure transport protocol
layer - to me that seems entirely wrong - it has the same "issue" as a
above, it depends on the function of the aliveness check and the way
this is used by the layer's protocol machine. In many cases it is
absolutely desirable to do this within the layer that needs this
information. Passing the detailed state down between layers can be most
awkward. Higher layers can make there own decisions - and suppress
keep-alives or reaffirm state.
Guidance from the transport perspective on timers is in RFC8085 in 3.1.1
, there is also more advice in the "behave" RFCs and a summary of the
mechanisms in RFC8085 3.5 (noted by Lars) .... The vulnerabilities are
also noted in RFC8085, and I think we should be clear to differentiate
between on-path versus off path knowledge when understanding this.
Gorry