Adding some comments here. I'm playinh catch-up, so I may have comments on some things that have been fixed, and missed others.

On 16/08/2018, 08:28, Mikael Abrahamsson wrote:
On Wed, 15 Aug 2018, Kent Watsen wrote:

You bring up an interesting point, it goes to the motivation for wanting to do keepalives in the first place. The text doesn't yet mention maintain flow state as a motivation.

It's not only to maintain flow state, it's also to close the connection when the network goes down and doesn't work anymore, and "give up" on connections that doesn't work anymore (for some definition of "anymore").

I have operationally been in the situation where a server/client application was implemented so that the server could only handle 256 connections (some filedescriptor limit). Every time the firewall was rebooted, lost state, the connection hung around forever. So the server administrators had to go in and restart the process to clear these connections, otherwise there were 256 hung connections and no new connections could be established.

Sometimes the other endpoint goes down, and doesn't come back. We will for instance deploy home gateways probably keeping netconf-call-home sessions to an NMS, and we want them to be around forever, as long as they work. TCP level keepalives would solve this, as if the customer just powers off the device, after a while the session will be cleared. Using TCP keepalives here means you get this kind of behaviour even if the upper-layer application doesn't support it (netconf might have been a bad example here). It's a single socket option to set, so it's very easy to do.

Agree. I think if we look to the transport layer that allowing a flow to continue to use existing "network" state (in various forms) is an important aspect - there are NATs, Firewalls, QoS Classifiers, etc as well as load balancers, and layer 2/3's that take resource decisions at the flow level. Normally all of these do the correct thing when there is a continuous flow of packets.

Somewhere in the thread I also saw statement that suggested that asosciations should be short-lived - If that advice is carried to the transport layer, I would expect it to have serious impact on the performance for some paths! (There are important trade-offs here, and we should not make sweeping assumptions).
From knowing approximately what settings people have in their NAT44 and
firewalls etc, I'd say the recommendation should be that keepalives are set to around 60-300 second interval, and then kill the connection if no traffic has passed in 3-5 of these intervals, kill the connection. Otherwise TCP will have backed off so far anyway, that it's probably faster to just re-try the connection instead of waiting for TCP to re-send the packet.

I have seen so many times in my 20 years working in networking where lack of keepalives have caused all kinds of problems. I wish everybody would turn it on and keep it on.

I agree. I have the feeling that this is at all not easy advice to get correct in a general way (and this thread is quite there yet). e.g., RFC 5245 set lower limits for timers - because that was thought important.

I don't agree that protocol stacks with a secure transport protocol layer (e.g., TLS, SSH, DTLS) that sits on top of a cleartext protocol layer (e.g., TCP, UDP) should be advised to do the aliveness check only within protection envelope afforded by the secure transport protocol layer - to me that seems entirely wrong - it has the same "issue" as a above, it depends on the function of the aliveness check and the way this is used by the layer's protocol machine. In many cases it is absolutely desirable to do this within the layer that needs this information. Passing the detailed state down between layers can be most awkward. Higher layers can make there own decisions - and suppress keep-alives or reaffirm state.

Guidance from the transport perspective on timers is in RFC8085 in 3.1.1 , there is also more advice in the "behave" RFCs and a summary of the mechanisms in RFC8085 3.5 (noted by Lars) .... The vulnerabilities are also noted in RFC8085, and I think we should be clear to differentiate between on-path versus off path knowledge when understanding this.

Gorry

Reply via email to