Re: openvswitch conntrack and nat problem in first packet reply with RST

2017-03-14 Thread wenxu

you are correct! Thanks very much.

It's works  set a new example as following.

ip,in_port=2 actions=ct(table=1,zone=1,nat)
ip,in_port=3 actions=ct(table=1,zone=1,nat)

table=1, ct_state=+new+trk,tcp,in_port=2,tp_dst=123 
actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:3
table=1, ct_state=+new+trk,icmp,in_port=2 
actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:3
table=1, ct_state=+new+trk,ip,in_port=3 
actions=ct(commit,zone=1,nat(dst=192.168.0.7)),output:2
table=1, ct_state=+new+trk, priority=100, tcp,in_port=3,tp_dst=123 actions=drop
table=1, ct_state=+est+trk,ip,in_port=3 actions=output:2
table=1, ct_state=+est+trk,ip,in_port=2 actions=output:3





> On 13 March 2017 at 20:18, wenxu <we...@ucloud.cn> wrote:
>> Hi all,
>>
>> There is a simple test for conntrack and nat in openvswitch.  I want to do 
>> stateful
>> firewall with conntrack then do nat
>>
>> netns1 port1 with ip 10.0.0.7
>> netns2 port2 with ip 1.1.1.7
>>
>> netns1 10.0.0.7 src -nat to 2.2.1.7 access netns2 1.1.1.7
>>
>> 1. # ovs-ofctl add-flow br0  'ip,in_port=1 actions=ct(table=1,zone=1)'
>> 2. # ovs-ofctl add-flow br0  'ip,in_port=2 actions=ct(table=1,zone=1)'
>> 3. # ovs-ofctl add-flow br0  'table=1, 
>> ct_state=+new+trk,tcp,in_port=1,tp_dst=123 
>> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
>> 4. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=2 
>> actions=ct(commit,zone=1,nat(dst=10.0.0.7)),output:1'
>> 5. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=1  
>> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
>>
>>
>> I  found that  netns1 can access 1.1.1.7:123  when there is 123-port listen 
>> on 1.1.1.7  in netns2
>>
>> But if there is no listen 123 port, The first RST packet reply by 1.1.1.7
>> (no datapath kernel rule) can't do dst-nat back to 10.0.0.7.  The second RST 
>> packet is ok (there is datapath kernel rule which comes from first RST 
>> packet)
>>
>> # tcpdump -i eth0 -nnn
>> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
>> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
>> 14:44:13.575200 IP 10.0.0.7.39891 > 1.1.1.7.123: Flags [S], seq 93585, 
>> win 29200, options [mss 1460,sackOK,TS val 584707316 ecr 0,nop,wscale 7], 
>> length 0
>> 14:44:13.576036 IP 1.1.1.7.123 > 2.2.1.7.39891: Flags [R.], seq 0, ack 
>> 93586, win 0, length 0
>>
>> But the datapath flow is correct
>> # ovs-dpctl dump-flows
>> recirc_id(0),in_port(7),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
>> used:never, actions:ct(zone=1),recirc(0x5a)
>> recirc_id(0x5a),in_port(7),ct_state(+new+trk),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=123),
>>  packets:0, bytes:0, used:never,
>> actions:ct(commit,zone=1,nat(src=2.2.1.7)),8
>> recirc_id(0),in_port(8),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
>> used:never, actions:ct(zone=1),recirc(0x5b)
>> recirc_id(0x5b),in_port(8),ct_state(-new+est+trk),eth_type(0x0800),ipv4(frag=no),
>>  packets:0, bytes:0, used:never,
>> actions:ct(commit,zone=1,nat(dst=10.0.0.7)),7
>>
>>
>> I think It's a matter with the PACKET-OUT and RST packet
>>
>> There are two packet-out for rule2 and rul4. Rule2 go through connect track 
>> and find it is an RST packet then delete the conntrack . It leads the second 
>> packet(come from rule4) can't find the conntack to do dst-nat.
>>
>> In "netfilter/nf_conntrack_proto_tcp.c file
>>  if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
>> /* If only reply is a RST, we can consider ourselves not to
>>have an established connection: this is a fairly common
>>problem case, so we can delete the conntrack
>>immediately.  --RR */
>> if (th->rst ) {
>> nf_ct_kill_acct(ct, ctinfo, skb);
>> return NF_ACCEPT;
>> }
>> }
>>
>>
>> It should add a switch to avoid this conntrack  be deleted.
>>
>> if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
>> /* If only reply is a RST, we can consider ourselves not to
>>have an established connection: this is a fairly common
>>problem case, so we can delete the conntrack
>>immediately.  --RR */
>> -if (th->rst ) {
>> +if (th->rst && !nf_ct_tcp_rst_no_kill) {
>> nf_ct_kill_acct(ct, ctinfo, skb);
>>

openvswitch conntrack and nat problem in first packet reply with RST

2017-03-13 Thread wenxu
Hi all,

There is a simple test for conntrack and nat in openvswitch.  I want to do 
stateful
firewall with conntrack then do nat

netns1 port1 with ip 10.0.0.7
netns2 port2 with ip 1.1.1.7

netns1 10.0.0.7 src -nat to 2.2.1.7 access netns2 1.1.1.7

1. # ovs-ofctl add-flow br0  'ip,in_port=1 actions=ct(table=1,zone=1)'
2. # ovs-ofctl add-flow br0  'ip,in_port=2 actions=ct(table=1,zone=1)'
3. # ovs-ofctl add-flow br0  'table=1, 
ct_state=+new+trk,tcp,in_port=1,tp_dst=123 
actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
4. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=2 
actions=ct(commit,zone=1,nat(dst=10.0.0.7)),output:1'
5. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=1  
actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'


I  found that  netns1 can access 1.1.1.7:123  when there is 123-port listen on 
1.1.1.7  in netns2

But if there is no listen 123 port, The first RST packet reply by 1.1.1.7
(no datapath kernel rule) can't do dst-nat back to 10.0.0.7.  The second RST 
packet is ok (there is datapath kernel rule which comes from first RST packet)

# tcpdump -i eth0 -nnn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
14:44:13.575200 IP 10.0.0.7.39891 > 1.1.1.7.123: Flags [S], seq 93585, win 
29200, options [mss 1460,sackOK,TS val 584707316 ecr 0,nop,wscale 7], length 0
14:44:13.576036 IP 1.1.1.7.123 > 2.2.1.7.39891: Flags [R.], seq 0, ack 
93586, win 0, length 0

But the datapath flow is correct
# ovs-dpctl dump-flows
recirc_id(0),in_port(7),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
used:never, actions:ct(zone=1),recirc(0x5a)
recirc_id(0x5a),in_port(7),ct_state(+new+trk),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=123),
 packets:0, bytes:0, used:never,
actions:ct(commit,zone=1,nat(src=2.2.1.7)),8
recirc_id(0),in_port(8),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
used:never, actions:ct(zone=1),recirc(0x5b)
recirc_id(0x5b),in_port(8),ct_state(-new+est+trk),eth_type(0x0800),ipv4(frag=no),
 packets:0, bytes:0, used:never,
actions:ct(commit,zone=1,nat(dst=10.0.0.7)),7


I think It's a matter with the PACKET-OUT and RST packet

There are two packet-out for rule2 and rul4. Rule2 go through connect track and 
find it is an RST packet then delete the conntrack . It leads the second 
packet(come from rule4) can't find the conntack to do dst-nat.

In "netfilter/nf_conntrack_proto_tcp.c file
 if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
/* If only reply is a RST, we can consider ourselves not to
   have an established connection: this is a fairly common
   problem case, so we can delete the conntrack
   immediately.  --RR */
if (th->rst ) {
nf_ct_kill_acct(ct, ctinfo, skb);
return NF_ACCEPT;
}
}


It should add a switch to avoid this conntrack  be deleted.

if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
/* If only reply is a RST, we can consider ourselves not to
   have an established connection: this is a fairly common
   problem case, so we can delete the conntrack
   immediately.  --RR */
-if (th->rst ) {
+if (th->rst && !nf_ct_tcp_rst_no_kill) {
nf_ct_kill_acct(ct, ctinfo, skb);
return NF_ACCEPT;
}


BR
wenxu





Re: [PATCH] net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset

2016-08-21 Thread wenxu
> In b8247f095e,
>
>"net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow 
> segmentation for local udp tunneled skbs"
>
> gso skbs arriving from an ingress interface that go through UDP
> tunneling, are allowed to be fragmented if the resulting encapulated
> segments exceed the dst mtu of the egress interface.
>
> This aligned the behavior of gso skbs to non-gso skbs going through udp
> encapsulation path.
>
> However the non-gso vs gso anomaly is present also in the following
> cases of a GRE tunnel:
>  - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
>(e.g. OvS vport-gre with df_default=false)
>  - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set
>
> In both of the above cases, the non-gso skbs get fragmented, whereas the
> gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
> as they don't go through the segment+fragment code path.
>
> Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.
>
> Tunnels that do set IP_DF, will not go to fragmentation of segments.
> This preserves behavior of ip_gre in (the default) pmtudisc mode.
>
> Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen 
> exceeds MTU, allow segmentation for local udp tunneled skbs")
> Reported-by: wenxu <we...@ucloud.cn>
Tested-by: wenxu <we...@ucloud.cn>
> Cc: Hannes Frederic Sowa <han...@stressinduktion.org>
> Signed-off-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
> ---
>  
>  wenxu, can you please add a Tested-by?
>
>  net/ipv4/ip_tunnel_core.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index 9d847c3025..0f227db0e9 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -73,9 +73,11 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, 
> struct sk_buff *skb,
>   skb_dst_set(skb, >dst);
>   memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
>  
> - if (skb_iif && proto == IPPROTO_UDP) {
> - /* Arrived from an ingress interface and got udp encapuslated.
> -  * The encapsulated network segment length may exceed dst mtu.
> + if (skb_iif && !(df & htons(IP_DF))) {
> + /* Arrived from an ingress interface, got encapsulated, with
> +  * fragmentation of encapulating frames allowed.
> +  * If skb is gso, the resulting encapsulated network segments
> +  * may exceed dst mtu.
>* Allow IP Fragmentation of segments.
>*/
>   IPCB(skb)->flags |= IPSKB_FRAG_SEGS;




Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs

2016-08-16 Thread wenxu
Hi,

> Hi,
>
> On Fri, 12 Aug 2016 13:11:50 +0200, han...@stressinduktion.org wrote:
>> I really would not like to see this expanded to gre and other protocols.
>> All switches drop packets where the packets are exceeding the MTU,
>> bridges and also openvswitch should behave the same.
>>
>> Unfortunately we already had this loophole in the kernel that vxlan udp
>> output path could fragment the packet again, even in case of switches.
>> But this stopped working for GSO packets, which violates another rule in
>> the kernel, GSO should always be transparent and user space should never
>> have to care if a packet is GSO or not.
>>
>> Because we couldn't a) roll back the change that we fragment packets in
>> UDP output paths and b) should not violate GSO transparency rule, I
>> strongly believed it would be better too only change the kernel in a way
>> that it transparently works with GSO, too. If we argue that a VTEP is
>> its own UDP endpoint which is set up after the bridge, I still can sleep
>> well. :)
>>
>> My understanding was that GRE failed consistently, GSO as well as
>> non-GSO packets are dropped, which would be the correct behavior for me.
>> I don't want to change this. A good argument against this would be if we
>> violate the GSO transparency rule again. But when I looked into the code
>> I couldn't see that.
> I completely agree with your arguments.
>
> I think we may run into the same GSO vs Non-GSO anomaly if one uses
> a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the
> encapsulating iphdr 'df' is derived from 'tun_flags_DONT_FRAGMENT'
> (e.g. in case DF is not set).
>
> I suspect OvS's vport-gre does exactly that, so I assume this is the
> reason why the change was suggested.
>
> Maybe we can change our criteria in the following manner:
>  
> - if (skb_iif && proto == IPPROTO_UDP) {
> + if (skb_iif && !(df & htons(IP_DF))) {
>   IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
>
> This way, any tunnel explicitly providing DF is NOT allowed to
> further fragment the resulting segments (leading to tx segments being
> dropped).
> Others tunnels, that do not care (e.g. vxlan and geneve, and probably
> ovs vport-gre, or other ovs encap vports, in df_default=false mode),
> will behave same for gso and non-gso.
>
> WDYT? Am I missing something here?
>
> Thanks,
> Shmulik

I think the criteria 'skb_iif && !(df & htons(IP_DF)' is suitable.   
'nopmtudisc' tunnel and ovs-gre tunnel can clear the DF through 
df_default=false.
In this situation both gso(final segment) or no-gso packet can be fragment(if 
the packet size is more than mtu).

Thanks
Wenxu




Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs

2016-08-11 Thread wenxu
> On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <da...@davemloft.net> 
> wrote:
>>> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
>>> exceeds MTU, allow segmentation for local udp tunneled skbs")
>>>
>>> Given:
>>>  - tap0 and ovs-gre
>>>  - ovs-gre stacked on eth0, eth0 having the small mtu
>>>
>>> After encapsulation these skbs have skb_gso_network_seglen that exceed
>>> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
>>> eth0 mtu. These packets maybe dropped.
>>>
>>> It has the same problem if tap0 bridge with ipgre or gretap device. So
>>> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.
>>>
>>> Signed-off-by: wenxu <we...@ucloud.cn>  
>> I am rather certain that this test is intentionally restricted to
>> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe.
>>
>> Hannes and Shmulik?
> It was restricted to UDP tun encaps per Hannes' suggestion, in order to
> scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1].
>
> Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to
> handle PMTU, but - if I understand correctly - only in the !skb_is_gso
> case.
> (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE")
>
> This is probably due the same hidden assumption that 'gso_size' will be
> suitable for the egress mtu even after encapsulation, which does not
> hold true in some usecases as described in [2].
>
> Therefore it might be that b8247f095edd needs to be augmented for
> non-udp tunnels as well.
>
> Hannes?
>
> [1] http://www.spinics.net/lists/netdev/msg386447.html
> [2] http://www.spinics.net/lists/netdev/msg385776.html
I think non-udp tunnels should also be set
And in b8247f095edd, condition skb_iif also should be removed
given:
ovs-internal-port bridge with ovs-gre

There is the same problem that packets from local.



[PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs

2016-08-09 Thread wenxu
From: wenxu <we...@ucloud.cn>

commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen
exceeds MTU, allow segmentation for local udp tunneled skbs")

Given:
 - tap0 and ovs-gre
 - ovs-gre stacked on eth0, eth0 having the small mtu

After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than
eth0 mtu. These packets maybe dropped.

It has the same problem if tap0 bridge with ipgre or gretap device. So
the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs.

Signed-off-by: wenxu <we...@ucloud.cn>
---
 net/ipv4/ip_tunnel_core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 9d847c3..7f36ea2 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -73,8 +73,9 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
skb_dst_set(skb, >dst);
memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
 
-   if (skb_iif && proto == IPPROTO_UDP) {
-   /* Arrived from an ingress interface and got udp encapuslated.
+   if (skb_iif && (proto == IPPROTO_UDP || proto == IPPROTO_GRE)) {
+   /* Arrived from an ingress interface and got udp or gre
+* encapsulated.
 * The encapsulated network segment length may exceed dst mtu.
 * Allow IP Fragmentation of segments.
 */
-- 
1.8.3.1




[PATCH] [stable 4.1.y PACTH] openvswitch: fix crash cause by non-nvgre packet

2015-12-22 Thread wenxu
kernel BUG at include/linux/skbuff.h:1219!
invalid opcode:  [#1] SMP
RIP: 0010:[] ovs_flow_extract+0x8ed/0xa40 [openvswitch]
Call Trace:

ovs_dp_process_received_packet+0x44/0x80 [openvswitch]
ovs_vport_receive+0x2e/0x30 [openvswitch]
gre_rcv+0xac/0xd0 [openvswitch]
gre_cisco_rcv+0x1c2/0x310 [openvswitch]
gre_rcv+0x59/0x80 [openvswitch]

ovs_flow_extract call __skb_pull to lead BUG_ON(skb->len < skb->data_len)
if the gre header protocol is not TEB and most part of the packet is in
the nolinear-spatial.

1. gre_rcv: pskb_may_pull(skb, 12)
pull the 12 bytes to linear-spatial(skb->data). The gre header is 8 bytes
only with key.

2. gre_cisco_rcv-->parse_gre_header-->iptunnel_pull_header
{
if (inner_proto == htons(ETH_P_TEB)) {
struct ethhdr *eh;

if (unlikely(!pskb_may_pull(skb, ETH_HLEN)))
return -ENOMEM;
..
}
}
The wrong inner_proto leads no pull the Mac header to linear-spatial

3. finally It made a crash in ovs_flow_extract->__skb_pull

Signed-off-by: wenxu <we...@ucloud.cn>
---
 net/openvswitch/vport-gre.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index f17ac96..4a993b5 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -107,6 +107,9 @@ static int gre_rcv(struct sk_buff *skb,
if (unlikely(!vport))
return PACKET_REJECT;
 
+   if (unlikely(tpi->proto != htons(ETH_P_TEB)))
+   return PACKET_REJECT;
+
key = key_to_tunnel_id(tpi->key, tpi->seq);
ovs_flow_tun_info_init(_info, ip_hdr(skb), 0, 0, key,
   filter_tnl_flags(tpi->flags), NULL, 0);
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html