Re: openvswitch conntrack and nat problem in first packet reply with RST
you are correct! Thanks very much. It's works set a new example as following. ip,in_port=2 actions=ct(table=1,zone=1,nat) ip,in_port=3 actions=ct(table=1,zone=1,nat) table=1, ct_state=+new+trk,tcp,in_port=2,tp_dst=123 actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:3 table=1, ct_state=+new+trk,icmp,in_port=2 actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:3 table=1, ct_state=+new+trk,ip,in_port=3 actions=ct(commit,zone=1,nat(dst=192.168.0.7)),output:2 table=1, ct_state=+new+trk, priority=100, tcp,in_port=3,tp_dst=123 actions=drop table=1, ct_state=+est+trk,ip,in_port=3 actions=output:2 table=1, ct_state=+est+trk,ip,in_port=2 actions=output:3 > On 13 March 2017 at 20:18, wenxu <we...@ucloud.cn> wrote: >> Hi all, >> >> There is a simple test for conntrack and nat in openvswitch. I want to do >> stateful >> firewall with conntrack then do nat >> >> netns1 port1 with ip 10.0.0.7 >> netns2 port2 with ip 1.1.1.7 >> >> netns1 10.0.0.7 src -nat to 2.2.1.7 access netns2 1.1.1.7 >> >> 1. # ovs-ofctl add-flow br0 'ip,in_port=1 actions=ct(table=1,zone=1)' >> 2. # ovs-ofctl add-flow br0 'ip,in_port=2 actions=ct(table=1,zone=1)' >> 3. # ovs-ofctl add-flow br0 'table=1, >> ct_state=+new+trk,tcp,in_port=1,tp_dst=123 >> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2' >> 4. # ovs-ofctl add-flow br0 'table=1, ct_state=+est+trk,ip,in_port=2 >> actions=ct(commit,zone=1,nat(dst=10.0.0.7)),output:1' >> 5. # ovs-ofctl add-flow br0 'table=1, ct_state=+est+trk,ip,in_port=1 >> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2' >> >> >> I found that netns1 can access 1.1.1.7:123 when there is 123-port listen >> on 1.1.1.7 in netns2 >> >> But if there is no listen 123 port, The first RST packet reply by 1.1.1.7 >> (no datapath kernel rule) can't do dst-nat back to 10.0.0.7. The second RST >> packet is ok (there is datapath kernel rule which comes from first RST >> packet) >> >> # tcpdump -i eth0 -nnn >> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode >> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes >> 14:44:13.575200 IP 10.0.0.7.39891 > 1.1.1.7.123: Flags [S], seq 93585, >> win 29200, options [mss 1460,sackOK,TS val 584707316 ecr 0,nop,wscale 7], >> length 0 >> 14:44:13.576036 IP 1.1.1.7.123 > 2.2.1.7.39891: Flags [R.], seq 0, ack >> 93586, win 0, length 0 >> >> But the datapath flow is correct >> # ovs-dpctl dump-flows >> recirc_id(0),in_port(7),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, >> used:never, actions:ct(zone=1),recirc(0x5a) >> recirc_id(0x5a),in_port(7),ct_state(+new+trk),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=123), >> packets:0, bytes:0, used:never, >> actions:ct(commit,zone=1,nat(src=2.2.1.7)),8 >> recirc_id(0),in_port(8),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, >> used:never, actions:ct(zone=1),recirc(0x5b) >> recirc_id(0x5b),in_port(8),ct_state(-new+est+trk),eth_type(0x0800),ipv4(frag=no), >> packets:0, bytes:0, used:never, >> actions:ct(commit,zone=1,nat(dst=10.0.0.7)),7 >> >> >> I think It's a matter with the PACKET-OUT and RST packet >> >> There are two packet-out for rule2 and rul4. Rule2 go through connect track >> and find it is an RST packet then delete the conntrack . It leads the second >> packet(come from rule4) can't find the conntack to do dst-nat. >> >> In "netfilter/nf_conntrack_proto_tcp.c file >> if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) { >> /* If only reply is a RST, we can consider ourselves not to >>have an established connection: this is a fairly common >>problem case, so we can delete the conntrack >>immediately. --RR */ >> if (th->rst ) { >> nf_ct_kill_acct(ct, ctinfo, skb); >> return NF_ACCEPT; >> } >> } >> >> >> It should add a switch to avoid this conntrack be deleted. >> >> if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) { >> /* If only reply is a RST, we can consider ourselves not to >>have an established connection: this is a fairly common >>problem case, so we can delete the conntrack >>immediately. --RR */ >> -if (th->rst ) { >> +if (th->rst && !nf_ct_tcp_rst_no_kill) { >> nf_ct_kill_acct(ct, ctinfo, skb); >>
openvswitch conntrack and nat problem in first packet reply with RST
Hi all, There is a simple test for conntrack and nat in openvswitch. I want to do stateful firewall with conntrack then do nat netns1 port1 with ip 10.0.0.7 netns2 port2 with ip 1.1.1.7 netns1 10.0.0.7 src -nat to 2.2.1.7 access netns2 1.1.1.7 1. # ovs-ofctl add-flow br0 'ip,in_port=1 actions=ct(table=1,zone=1)' 2. # ovs-ofctl add-flow br0 'ip,in_port=2 actions=ct(table=1,zone=1)' 3. # ovs-ofctl add-flow br0 'table=1, ct_state=+new+trk,tcp,in_port=1,tp_dst=123 actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2' 4. # ovs-ofctl add-flow br0 'table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(commit,zone=1,nat(dst=10.0.0.7)),output:1' 5. # ovs-ofctl add-flow br0 'table=1, ct_state=+est+trk,ip,in_port=1 actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2' I found that netns1 can access 1.1.1.7:123 when there is 123-port listen on 1.1.1.7 in netns2 But if there is no listen 123 port, The first RST packet reply by 1.1.1.7 (no datapath kernel rule) can't do dst-nat back to 10.0.0.7. The second RST packet is ok (there is datapath kernel rule which comes from first RST packet) # tcpdump -i eth0 -nnn tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 14:44:13.575200 IP 10.0.0.7.39891 > 1.1.1.7.123: Flags [S], seq 93585, win 29200, options [mss 1460,sackOK,TS val 584707316 ecr 0,nop,wscale 7], length 0 14:44:13.576036 IP 1.1.1.7.123 > 2.2.1.7.39891: Flags [R.], seq 0, ack 93586, win 0, length 0 But the datapath flow is correct # ovs-dpctl dump-flows recirc_id(0),in_port(7),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:ct(zone=1),recirc(0x5a) recirc_id(0x5a),in_port(7),ct_state(+new+trk),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=123), packets:0, bytes:0, used:never, actions:ct(commit,zone=1,nat(src=2.2.1.7)),8 recirc_id(0),in_port(8),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:ct(zone=1),recirc(0x5b) recirc_id(0x5b),in_port(8),ct_state(-new+est+trk),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:ct(commit,zone=1,nat(dst=10.0.0.7)),7 I think It's a matter with the PACKET-OUT and RST packet There are two packet-out for rule2 and rul4. Rule2 go through connect track and find it is an RST packet then delete the conntrack . It leads the second packet(come from rule4) can't find the conntack to do dst-nat. In "netfilter/nf_conntrack_proto_tcp.c file if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) { /* If only reply is a RST, we can consider ourselves not to have an established connection: this is a fairly common problem case, so we can delete the conntrack immediately. --RR */ if (th->rst ) { nf_ct_kill_acct(ct, ctinfo, skb); return NF_ACCEPT; } } It should add a switch to avoid this conntrack be deleted. if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) { /* If only reply is a RST, we can consider ourselves not to have an established connection: this is a fairly common problem case, so we can delete the conntrack immediately. --RR */ -if (th->rst ) { +if (th->rst && !nf_ct_tcp_rst_no_kill) { nf_ct_kill_acct(ct, ctinfo, skb); return NF_ACCEPT; } BR wenxu
Re: [PATCH] net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset
> In b8247f095e, > >"net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow > segmentation for local udp tunneled skbs" > > gso skbs arriving from an ingress interface that go through UDP > tunneling, are allowed to be fragmented if the resulting encapulated > segments exceed the dst mtu of the egress interface. > > This aligned the behavior of gso skbs to non-gso skbs going through udp > encapsulation path. > > However the non-gso vs gso anomaly is present also in the following > cases of a GRE tunnel: > - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set >(e.g. OvS vport-gre with df_default=false) > - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set > > In both of the above cases, the non-gso skbs get fragmented, whereas the > gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped, > as they don't go through the segment+fragment code path. > > Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set. > > Tunnels that do set IP_DF, will not go to fragmentation of segments. > This preserves behavior of ip_gre in (the default) pmtudisc mode. > > Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen > exceeds MTU, allow segmentation for local udp tunneled skbs") > Reported-by: wenxu <we...@ucloud.cn> Tested-by: wenxu <we...@ucloud.cn> > Cc: Hannes Frederic Sowa <han...@stressinduktion.org> > Signed-off-by: Shmulik Ladkani <shmulik.ladk...@gmail.com> > --- > > wenxu, can you please add a Tested-by? > > net/ipv4/ip_tunnel_core.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c > index 9d847c3025..0f227db0e9 100644 > --- a/net/ipv4/ip_tunnel_core.c > +++ b/net/ipv4/ip_tunnel_core.c > @@ -73,9 +73,11 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, > struct sk_buff *skb, > skb_dst_set(skb, >dst); > memset(IPCB(skb), 0, sizeof(*IPCB(skb))); > > - if (skb_iif && proto == IPPROTO_UDP) { > - /* Arrived from an ingress interface and got udp encapuslated. > - * The encapsulated network segment length may exceed dst mtu. > + if (skb_iif && !(df & htons(IP_DF))) { > + /* Arrived from an ingress interface, got encapsulated, with > + * fragmentation of encapulating frames allowed. > + * If skb is gso, the resulting encapsulated network segments > + * may exceed dst mtu. >* Allow IP Fragmentation of segments. >*/ > IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
Hi, > Hi, > > On Fri, 12 Aug 2016 13:11:50 +0200, han...@stressinduktion.org wrote: >> I really would not like to see this expanded to gre and other protocols. >> All switches drop packets where the packets are exceeding the MTU, >> bridges and also openvswitch should behave the same. >> >> Unfortunately we already had this loophole in the kernel that vxlan udp >> output path could fragment the packet again, even in case of switches. >> But this stopped working for GSO packets, which violates another rule in >> the kernel, GSO should always be transparent and user space should never >> have to care if a packet is GSO or not. >> >> Because we couldn't a) roll back the change that we fragment packets in >> UDP output paths and b) should not violate GSO transparency rule, I >> strongly believed it would be better too only change the kernel in a way >> that it transparently works with GSO, too. If we argue that a VTEP is >> its own UDP endpoint which is set up after the bridge, I still can sleep >> well. :) >> >> My understanding was that GRE failed consistently, GSO as well as >> non-GSO packets are dropped, which would be the correct behavior for me. >> I don't want to change this. A good argument against this would be if we >> violate the GSO transparency rule again. But when I looked into the code >> I couldn't see that. > I completely agree with your arguments. > > I think we may run into the same GSO vs Non-GSO anomaly if one uses > a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the > encapsulating iphdr 'df' is derived from 'tun_flags_DONT_FRAGMENT' > (e.g. in case DF is not set). > > I suspect OvS's vport-gre does exactly that, so I assume this is the > reason why the change was suggested. > > Maybe we can change our criteria in the following manner: > > - if (skb_iif && proto == IPPROTO_UDP) { > + if (skb_iif && !(df & htons(IP_DF))) { > IPCB(skb)->flags |= IPSKB_FRAG_SEGS; > > This way, any tunnel explicitly providing DF is NOT allowed to > further fragment the resulting segments (leading to tx segments being > dropped). > Others tunnels, that do not care (e.g. vxlan and geneve, and probably > ovs vport-gre, or other ovs encap vports, in df_default=false mode), > will behave same for gso and non-gso. > > WDYT? Am I missing something here? > > Thanks, > Shmulik I think the criteria 'skb_iif && !(df & htons(IP_DF)' is suitable. 'nopmtudisc' tunnel and ovs-gre tunnel can clear the DF through df_default=false. In this situation both gso(final segment) or no-gso packet can be fragment(if the packet size is more than mtu). Thanks Wenxu
Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
> On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <da...@davemloft.net> > wrote: >>> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen >>> exceeds MTU, allow segmentation for local udp tunneled skbs") >>> >>> Given: >>> - tap0 and ovs-gre >>> - ovs-gre stacked on eth0, eth0 having the small mtu >>> >>> After encapsulation these skbs have skb_gso_network_seglen that exceed >>> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than >>> eth0 mtu. These packets maybe dropped. >>> >>> It has the same problem if tap0 bridge with ipgre or gretap device. So >>> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs. >>> >>> Signed-off-by: wenxu <we...@ucloud.cn> >> I am rather certain that this test is intentionally restricted to >> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe. >> >> Hannes and Shmulik? > It was restricted to UDP tun encaps per Hannes' suggestion, in order to > scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1]. > > Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to > handle PMTU, but - if I understand correctly - only in the !skb_is_gso > case. > (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE") > > This is probably due the same hidden assumption that 'gso_size' will be > suitable for the egress mtu even after encapsulation, which does not > hold true in some usecases as described in [2]. > > Therefore it might be that b8247f095edd needs to be augmented for > non-udp tunnels as well. > > Hannes? > > [1] http://www.spinics.net/lists/netdev/msg386447.html > [2] http://www.spinics.net/lists/netdev/msg385776.html I think non-udp tunnels should also be set And in b8247f095edd, condition skb_iif also should be removed given: ovs-internal-port bridge with ovs-gre There is the same problem that packets from local.
[PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs
From: wenxu <we...@ucloud.cn> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs") Given: - tap0 and ovs-gre - ovs-gre stacked on eth0, eth0 having the small mtu After encapsulation these skbs have skb_gso_network_seglen that exceed eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than eth0 mtu. These packets maybe dropped. It has the same problem if tap0 bridge with ipgre or gretap device. So the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs. Signed-off-by: wenxu <we...@ucloud.cn> --- net/ipv4/ip_tunnel_core.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index 9d847c3..7f36ea2 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -73,8 +73,9 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb, skb_dst_set(skb, >dst); memset(IPCB(skb), 0, sizeof(*IPCB(skb))); - if (skb_iif && proto == IPPROTO_UDP) { - /* Arrived from an ingress interface and got udp encapuslated. + if (skb_iif && (proto == IPPROTO_UDP || proto == IPPROTO_GRE)) { + /* Arrived from an ingress interface and got udp or gre +* encapsulated. * The encapsulated network segment length may exceed dst mtu. * Allow IP Fragmentation of segments. */ -- 1.8.3.1
[PATCH] [stable 4.1.y PACTH] openvswitch: fix crash cause by non-nvgre packet
kernel BUG at include/linux/skbuff.h:1219! invalid opcode: [#1] SMP RIP: 0010:[] ovs_flow_extract+0x8ed/0xa40 [openvswitch] Call Trace: ovs_dp_process_received_packet+0x44/0x80 [openvswitch] ovs_vport_receive+0x2e/0x30 [openvswitch] gre_rcv+0xac/0xd0 [openvswitch] gre_cisco_rcv+0x1c2/0x310 [openvswitch] gre_rcv+0x59/0x80 [openvswitch] ovs_flow_extract call __skb_pull to lead BUG_ON(skb->len < skb->data_len) if the gre header protocol is not TEB and most part of the packet is in the nolinear-spatial. 1. gre_rcv: pskb_may_pull(skb, 12) pull the 12 bytes to linear-spatial(skb->data). The gre header is 8 bytes only with key. 2. gre_cisco_rcv-->parse_gre_header-->iptunnel_pull_header { if (inner_proto == htons(ETH_P_TEB)) { struct ethhdr *eh; if (unlikely(!pskb_may_pull(skb, ETH_HLEN))) return -ENOMEM; .. } } The wrong inner_proto leads no pull the Mac header to linear-spatial 3. finally It made a crash in ovs_flow_extract->__skb_pull Signed-off-by: wenxu <we...@ucloud.cn> --- net/openvswitch/vport-gre.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c index f17ac96..4a993b5 100644 --- a/net/openvswitch/vport-gre.c +++ b/net/openvswitch/vport-gre.c @@ -107,6 +107,9 @@ static int gre_rcv(struct sk_buff *skb, if (unlikely(!vport)) return PACKET_REJECT; + if (unlikely(tpi->proto != htons(ETH_P_TEB))) + return PACKET_REJECT; + key = key_to_tunnel_id(tpi->key, tpi->seq); ovs_flow_tun_info_init(_info, ip_hdr(skb), 0, 0, key, filter_tnl_flags(tpi->flags), NULL, 0); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html