Hi John, Hi
Lets assume that your network deployment (for the ipsec tunnel) is as below: Note: The values mentioned are the mtu of the interfaces connected [appliance1]1500mtu-----1500mtu[openwrt-router1]1500mtu-------[internet]-----1500mtu[opnwrt-router2]1500mtu-----1500mtu[appliance2]1500mtu In this case the xfrm-based ipsec tunnel is between the router1 and router2 The below are some of the points to be considered to understand what's happening and why the iptables-mangle rule for TCPMSS is used for "mss-clamping" for each direction of the tcp-connection 1. You mentioned that the TCP traffic between the appliances flowing via the ipsec-tunnel is using large-packet-size. a) This would mean that the MSS that is negotiated between the appliance1 and appliance2 would always be set to 1460 (1500-40 bytes = 1460 bytes) - Note:The 40 bytes is TCP-Hdr of 20-bytes and the IP-Hdr of 20-bytes 2. Another point to note is that by default all the hosts/gateways/routers by default have the PMTUD (pmtu-discovery) enabled by default. So this means that every TCP/UDP connection "initiated" from each of them will have the DF-bit flag set for sure - To disable the PMTUD, the setting "/proc/sys/net/ipv4/ip_no_pmtu_disc" has to be set to 1 "echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc" - this will ensure that the tcp/udp packets are NOT set with the df-bit flag 3. With reference to the IPsec-Tunnel (using the xfrm-interfaces) established on each of the Router (router1/router2), it is to be noted that once the ipsec tunnel is established, based on the mtu size of the outbound interface (the wan interface - which is 1500 in this case), there is invariably a IPSEc-SA MTU set for "Outbound-SA" whose value is dependent on the encryption algorithm used (say AES256 for example) and the wan-interface-MTU. a) Iam not very sure as to where exactly we can see the IPSec-SA mtu that is set for a tunnel (using a specific algorithm), but based on my past recollections, i would say that for AES256 algo, the ipsec-SA MTU would be approximately about 1422 (1500 - <all the encryption/esp overhead applied>) for all outbound ipsec-esp packets b) So if the appliance1 was a host following the standards of PMTUD/etc behavior, when it sends a TCP/UDP packet (with DF-bit set) of size say 1500 and this arrives at the Peer-Router1 and after this traffic matches the ipsec-tunnel policy and needs to be forwarded thru the ipsec tunnel to Peer-Router2, then before encryption there is a check done against the "ipsec-SA-mtu" for the tunnel, which would be 1422. - So in this case the Peer-Router1 would send a icmp-unreachable message "(type-3/code-4) packet-too-big need to fragment, with the MTU value of 1422" TO the appliance1 - And if the appliance1 was following standards, then the icmp-packet-too-big message would trigger it to "re-negotiate" the TCP-connection with a reduced MSS value of 1422-40 = 1382 bytes.... - And the same process is expected to happen from the other end where the appliance2 TCP-host is connected - So this ensures that the TCP data connection is using a max packet size of 1382-tcppayload+40=1422 to avoid fragmentation at the ipsec-tunnel in outbound direction Note: In case of UDP-connections, if appliance1 was following standards, the icmp-packet-too-big message would result in the appliance1 itself fragmenting the large-packet into 2 fragments which after re-assembly at the Peer-Router1 would not be more than 1422-bytes. And this ensures that there is NO fragmentation at the ipsec-tunnel in outbound direction 4. So in your case, since both appliances are misbehaving and not following standards and ignoring the pmtu icmp-messages AND ofcourse sending traffic with DF-bit set, so: a) you have correctly applied one of the solutions to avoid fragmentation for TCP-connections: mss-clamping in both directions to be applied during the TCP-handshake negotiation (the tcp-control connection) iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o xfrm0 -j TCPMSS --set-mss 1240 iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -i xfrm0 -j TCPMSS --set-mss 1240 b) The above i believe you have applied only on router1. You could also the same on Router2 too, if you can. c) What the above mss-clamping does is that the MSS-value in the outgoing TCP-syn packet from appliance1 to appliance2 is re-written to 1240. This would inform the appliance2 that appliance1 is capable of processing tcp-data packets of max segment-size of 1240 ONLY. So appliance2 would always send tcp-packets of max 1240+40=1280 Only - the same happens in the other direction and therefore results in appliance1 always sending tcp-data packets with max size of 1280 (1240+40) Note: Also, generally the mss-clamping is applied at POSTROUTING, but then again if the above works in FORWARD, do continue with it #iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o xfrm0 -j TCPMSS --set-mss 1240 #iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -i xfrm0 -j TCPMSS --set-mss 1240 d) Some things you need to check after applying the above mss-clamping. - Capture the tcp-session packets flowing between appliance1 and peer-router1 lan-interface, and - check whether the MSS value is being negotiated and set to 1240 OR are both appliances1/2 continue to set their MSS to 1460 ignoring the mss-clamping? Note: One point you need to note with the above clamping value is: the mss being to set to 1240 means the IP-tcp packet size generated from appliance1/appliance2 will be of sizes 1240+40=1280 bytes size which would incase the wan-interface being 1500 and the ipsec-SA mtu(if at all it is set/used) would be 1422 (if AES256 algo is used). This should ideally result in NO Fragmentation at all e) BUT if you are saying that inspite of the mss-clamping, the appliances continue to send tcp-packets with MSS of 1460 AND DF-bit set, then there will fragmentation - atleast post-ipsec esp-fragmentation in outbound direction on each of the Ipsec-Peer-Routers 5. So another question to consider is, what about udp traffic generated from each of the appliances???? Are they generating large-size non-fragmented packets of 1500-bytes each AND DF-bit flag always set???? - in this case there is no clamping that can be done, except to apply the final alternate solution that applies to both TCP and UDP traffic....clearing the DF-bit flag in all of the TCP/UDP packets that are being generated from the appliances a) This i think can most probably done by "disabling pmtu-discovery" on both the appliances as below: "echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc" - this will ensure that the tcp/udp packets are NOT set with the df-bit flag b) The above should be possible on Linux/Unix systems if thats what the appliances are using. 6. Now coming to another important point you had asked, about how to "clear the df-bit flag setting" of the inbound plain TCP/UDP traffic/connections before they are encrypted into the ipsec tunnel a) well as such you cannot on Peer-Router1, but you should be able to on the 2 appliances as mentioned in point-5 above b) But FYI, in IPsec-tunnels as per the RFC-standard, every implementation is required to do the below "during encryption with ESP" i) copy the df-bit flag(if set) from the Inner-IP-hdr (of the plain tcp/udp packet) to the outer IP-Hdr of the ESP-packet that is generated by router/gateway ii) If the df-set flag is set in the Inner-IP hdr of the plain tcp/udp packet before encryption, then we could also "clear the df-bit" if implemented in the ipsec-engine locally on the router/gateway iii) If there is NO df-bit flag set in the Inner-IP hdr of the plain-tcp/udp packet before encryption, then you can apply the setting "set df-bit" in the outer IP-hdr of the ESP packet - Generally its always the "copy df-bit from Inner-IP-Hdr to Outer-IP-Hdr of ESP-packet" that is always implemented as a MUST (as per RFC requirements) - BUT this does not mean that this will prevent/clear the df-bit flag of incoming plain tcp/udp packets coming from the appliances before encryption. c) So FYI, since you are using XFRMi interfaces with strongswan-ikev2 and specifically using swanctl.conf, you may please try the below setting for "clearing the df-bit flag in the outer-ip-hdr of the ESP packets" connections.<conn>.children.<child>.copy_df (since 5.7.0) yes(by default) - Whether to copy the DF bit to the outer IPv4 header in tunnel mode. set this as: connections.<conn>.children.<child>.copy_df=no hope the above info helps somewhat thanks & regards Rajiv On Wed, Dec 15, 2021 at 7:35 AM Noel Kuntze <noel.kuntze+strongswan-users-ml@thermi.consulting> wrote: > Hello John, > > I am not aware of if the kernel tracks the assigned TCP MSS of the > connections it knows of. > Conntrack does not have that information. So it's a good question why > exactly that happens. > > Can you double check if there is not maybe something like a local proxy > running that could > be the cause of that? Also, what is the currently set MTU on the interface? > Does it coincide with the MSS (taking the TCP overhead into account)? > > I agree that it is likely extremely fragile. A good way would be a > userspace proxy, like squid. > Squid knows about conntrack, so can transparently proxy connections, even > without tproxy (speaking from memories). > > Kind regards > Noel > > > Am 03.12.21 um 15:35 schrieb John Marrett: > > I am working on a VPN solution connecting some appliances on two > > different networks. I’m using an x86 openwrt router with strongswan > > 5.9.2 and kernel 5.4.154. The systems I am connecting exhibit > > non-compliant TCP MSS behaviour. They are, for unknown reasons, > > ignoring the MSS from their peers and sending oversized packets. They > > also ignore ICMP unreachable messages indicating path MTU, I have > > confirmed that the ICMP unreachable messages are not blocked and they > > have been captured directly on the system sending the problematic > > traffic. I do not have control over the appliances and need to solve > > the issues at the network level. > > > > I'm using a modern IKEv2 / XFRM based configuration for this VPN. I > > would like to ignore the DF bit and fragment traffic passing through > > the VPN tunnel. This fragmentation could occur before or after > > encapsulation, it's not significant to me. > > > > If I was using a GRE tunnel I could use the ignore-df configuration > > [1], however there doesn't appear to be an equivalent with an xfrm > > interface. > > > > I have managed to "solve" my problem, though I do not understand the > > solution or how it works. If I create the following iptables rule to > > adjust the MSS on traffic traversing the xfrm interface: > > > > iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o xfrm0 > > -j TCPMSS --set-mss 1240 > > iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -i xfrm0 > > -j TCPMSS --set-mss 1240 > > > > Then, in addition to the expected modification of the mss field, my > > TCP traffic will be fragmented, ignoring the DF bit. > > > > Here's an excerpt of traffic in ingress to the router: > > > > 09:23:56.103022 IP 10.1.34.10.5060 > 10.1.61.20.25578: Flags [P.], seq > > 883:1906, ack 1760, win 260, length 1023 > > 09:23:56.119864 IP 10.1.61.20.25578 > 10.1.34.10.5060: Flags [.], ack > > 1906, win 501, length 0 > > 09:24:01.448960 IP 10.1.34.10.5060 > 10.1.61.20.25578: Flags [P.], seq > > 1906:3271, ack 1760, win 260, length 1365 > > 09:24:01.467771 IP 10.1.61.20.25578 > 10.1.34.10.5060: Flags [.], ack > > 3148, win 501, length 0 > > 09:24:01.467810 IP 10.1.61.20.25578 > 10.1.34.10.5060: Flags [.], ack > > 3271, win 501, length 0 > > > > And egress on the xfrm interface (In addition to being sent over a VPN > > connect the traffic is also being NATed by the VPN router): > > > > 09:23:56.103150 IP 10.2.30.1.5060 > 10.2.2.6.25578: Flags [P.], seq > > 881:1902, ack 1750, win 260, length 1021 > > 09:23:56.119828 IP 10.2.2.6.25578 > 10.2.30.1.5060: Flags [.], ack > > 1902, win 501, length 0 > > 09:24:01.449067 IP 10.2.30.1.5060 > 10.2.2.6.25578: Flags [.], seq > > 1902:3142, ack 1750, win 260, length 1240 > > 09:24:01.449135 IP 10.2.30.1.5060 > 10.2.2.6.25578: Flags [P.], seq > > 3142:3265, ack 1750, win 260, length 123 > > 09:24:01.467724 IP 10.2.2.6.25578 > 10.2.30.1.5060: Flags [.], ack > > 3142, win 501, length 0 > > 09:24:01.467725 IP 10.2.2.6.25578 > 10.2.30.1.5060: Flags [.], ack > > 3265, win 501, length 0 > > > > The packet with length 1365 has been split into a packet of 1240 bytes > > and a second of 123. > > > > Without these rules I see the expected behaviour, the packets are > > dropped and ICMP unreachable messages are sent indicating the path > > MTU. > > > > Is anyone able to explain why, in addition to adjusting the MSS, this > > mangle configuration is allowing fragmentation ignoring the DF bit? > > While the solution is working as I need it to, I'm concerned that it > > may be extremely fragile. > > > > Is there a better way to solve this problem? > > > > Thanks in advance for any help you can offer, > > > > -JohnF > > > > [1] https://man7.org/linux/man-pages/man8/ip-tunnel.8.html >